CN102955818A - Method for acquiring full names in Chinese from Web page - Google Patents
Method for acquiring full names in Chinese from Web page Download PDFInfo
- Publication number
- CN102955818A CN102955818A CN2011102531001A CN201110253100A CN102955818A CN 102955818 A CN102955818 A CN 102955818A CN 2011102531001 A CN2011102531001 A CN 2011102531001A CN 201110253100 A CN201110253100 A CN 201110253100A CN 102955818 A CN102955818 A CN 102955818A
- Authority
- CN
- China
- Prior art keywords
- full name
- candidate
- word
- frequency
- cfn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to a method for acquiring full names in Chinese from a Web page. The method comprises the steps of: inputting a known short form, selecting a query mode to establish a query item, submitting the query item to Google for acquiring an anchor text, then acquiring the corpus of the full names and the short forms from the anchor text, finally picking up candidate short forms by utilizing pick-up algorithms, and then sequencing the candidate short forms by utilizing the priority synthetic function, wherein two query modes are related, and two corresponding pick-up algorithms for picking up full name are used. The invention also defines a body of the relation between the full name and the short form, and the body comprises a set of constraint axiom and a constraint function set, wherein the constraint axiom qualitatively expresses the constraint between the full name and the short form, the constraint function set quantitatively expresses the constraint between the full name and the short form; moreover, based on the body of the relation between the full name and the short form, a full name testing method and a full name classification method are proposed. The method can realize large-scale and high-accuracy acquisition of the full names, and discusses the classification of the full names by using a computer, thereby providing an effective support for intelligent acquisition of extensive knowledge.
Description
Technical field
The full name that the present invention relates to Chinese information processing and information retrieval field obtains technology, relates in particular to a kind of method of obtaining the Chinese full name from the Web webpage, obtains the method for the Chinese full name of multidisciplinary, extensive, high-accuracy from the Web webpage.
Background technology
Natural language processing is a major issue in computer science and the artificial intelligence field.Its research can realize carrying out with natural language between people and the computing machine various theories and the method for efficient communication.Widespread use along with computing machine and internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, the object of natural language processing is also processed from the small-scale restricted language and is turned to extensive real text to process, and its research will produce far-reaching influence to people's life.
Chinese information processing is to study how to utilize computing machine that Chinese information is processed automatically.Chinese is that a meaning is closed language, compares with western language, lacks explicit mark, and grammer, semanteme, pragmatic side are also more flexible, have increased the difficulty of computer understanding and processing, allow computing machine can process Chinese information, still have many difficulties to overcome.At present, Chinese information processing has obtained some achievements in fields such as speech recognition, participle, mechanical translation.The lifting of Chinese information robotization degree for the treatment of will bring considerable benefit to the science and technology of China, culture, economy, safety etc.
How quick from the bulk information of numerous and complicated Research into information retrieval is, the technology of Obtaining Accurate information needed.Information retrieval technique is through for many years development, and quite ripe at present, the novel information retrieval technique is just towards future developments such as intellectuality, mobilism, variation, personalizations.
Full name (Full Name, Fn) be complete address to title, be called for short (Abbreviation, An) to be brevity and lucidity in order expressing, and the address that obtains after the compression to be simplified in full name, if Fn and An have full abbreviation relation, claim that then Fn is the full name of An, An is the abbreviation of Fn, is denoted as FA(Fn, An).By full name to being called for short, can be regarded as the compression process of a quantity of information, by being called for short to full name, then can be regarded as the process of a decompress(ion), for example: c1=" Inst. of Computing Techn. Academia Sinica " is compressed, obtain c2=" institute is calculated by the Chinese Academy of Sciences ", again c2 is compressed, obtain c3=" Computer Department of the Chinese Academy of Science ", the c3 decompress(ion) is obtained c2, again the c2 decompress(ion) is obtained c1.Full name all is relative concept with being called for short, and such as in upper example, c2 is to be called for short with respect to c1, but is full name with respect to c3, says that separately c2 is full name or to be called for short all be nonsensical.
The full Relation acquisition that is called for short obtains (Knowledge Acquisition from Text as text knowledge, KAT) and information retrieval etc. use in a basic and crucial problem, its acquisition methods can be divided into two large classes: a class is based on the method for pattern, mainly utilize linguistics and natural language processing technique, extract relation schema by lexical analysis and grammatical analysis, then utilize pattern match to obtain full abbreviation relation, the method accuracy rate depends on linguistic knowledge and pattern base; The another kind of method that is based on statistics mainly based on corpus and statistical language model, is obtained full abbreviation relation by the degree of association of calculating between the concept, and the method accuracy rate and efficient are difficult to the real requirement that reaches desirable.The full problem of obtaining that is called for short relation again can be from two angles: one is the angle of excavating, and it is right to obtain full abbreviation exactly under the condition that does not have extraneous input; Another is the angle of searching, and known exactly full name looks for abbreviation or known abbreviation to look for full name.
" full name " mentioned among the present invention or " abbreviation " if no special instructions, all refer to Chinese full name or Chinese abbreviation.
Summary of the invention
For the limitation or the not high defective of accuracy rate that have in the existing full abbreviation Relation acquisition technology, the invention provides a kind of accuracy rate height and be applicable to multidisciplinary, ultra-large a kind of method of from the Web webpage, obtaining the Chinese full name.
In order to address the above problem, the invention provides a kind of method of from the Web webpage, obtaining the Chinese full name, comprise step:
Step 1, given Chinese abbreviation of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EFN and extract candidate's full name from full abbreviation language materials, form the set of candidate's full name;
Step 5, checking based on full abbreviation relation constraint is carried out in candidate's full name set, formed the full name set;
Step 6, classification based on full abbreviation relation constraint is carried out in full name set, thereby formed the full name set with the classification mark.
In the technique scheme, in described step 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name ".We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the anchor language material, account for 24.76% with what query pattern 2 can obtain the anchor language material, account for 91.07% with what query pattern 1 or query pattern 2 can obtain the anchor language material.Therefore, in order to improve search efficiency, we preferentially select query pattern 1, and next selects query pattern 2.
In the technique scheme, in described step 4, full name extraction algorithm EFN comprises two algorithm EFN1 and EFN2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt EFN1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt EFN2 to extract Fn in the step 4.
In the technique scheme, in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.
In the technique scheme, in described step 5), entirely being called for short relation constraint is four-tuple R=(Fn, an An, F, A), wherein, Fn is the full name of object, An is the abbreviation of object, and F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Hereinafter will further make an explanation to these two kinds of constraints.
Beneficial effect: the present invention is the abbreviation that obtains its correspondence according to known full name from Web, namely obtain full abbreviation relation from the angle of searching, utilizing the schema-based method to come to obtain the candidate from Google is called for short, utilization comes candidate's abbreviation is verified based on the method for statistics, have multidisciplinary property, extensive, high accuracy for examination, and inquired into the classification that is called for short with computer realization, obtaining for the intelligence of extensive knowledge provides effective support.
Description of drawings
Fig. 1 serves as reasons and is called for short the total synoptic diagram that obtains full name:
Fig. 2 utilizes query pattern 1 to obtain the process flow diagram of full name:
Fig. 3 utilizes query pattern 2 to obtain the process flow diagram of full name;
The process flow diagram of Fig. 4 for candidate's full name collection is carried out aftertreatment;
Fig. 5 checking decision tree that the full constraint function collection that is called for short generates of serving as reasons.
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments:
Before method of the present invention is described, at first the formation rule and the word formation that are called for short in the full abbreviation relation are put in order and summed up.Be called for short in the relation complete, can be regarded as the compression process of a quantity of information to the process that is called for short by full name, in the compression process of quantity of information, sometimes have semantic equivalence conversion and the adjustment of word order, be divided into plain edition, different font and different order type so we will be called for short relation entirely.
Plain edition: each word in the abbreviation appears in the full name, and keeps their orders in full name, for example, and Fn=" People's Republic of China (PRC) ", An=" China ";
Different font: some word in the abbreviation does not occur in full name, has namely not only carried out the compression of quantity of information by full name to being called for short, and has also carried out semantic equivalence conversion, Fn=" Wa Huang Shengmumiao " for example, An=" Chinese mythology goddess mausoleum ";
Different order type: the order in the abbreviation between Chinese character is inconsistent with their orders of tie element in full name, for example, Fn=" Harbin the 6th pharmaceutical factory ", An=" breathes out medicine six factories ".
In the present invention, define full abbreviation relation constraint and represented constraint between Fn and the An, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Before constraint function collection and axiom of constraint collection are elaborated, be listed in the basic symbol that hereinafter uses:
An represents to be called for short;
Cfn represents candidate's full name of An;
Fn represents the full name of An;
The Google anchor text set of GoogleArchSet (An) expression An, this set of front 100 the anchor Chinese language that return when namely from Google, searching full name corresponding to An, if the anchor text that returns sum N is less than 100, then GoogleArchSet (An) only comprises only N bar anchor text;
Candidate's full name collection of CfnSet (An) expression An, the set that candidate's full name that the An that namely extracts from GoogleArchSet (An) is corresponding forms;
The number of contained candidate's full name among N_CfnSet (An) the expression CfnSet (An);
The full name collection of FnSet (An) expression An, i.e. the set of all elements among the CfnSet (An) through forming after the checking;
The abbreviation collection of AnSet (Fn) expression Fn, namely for given Fn, the correspondence of obtaining from Google is called for short the set that forms;
FA (Fn, An) expression Fn and An have full abbreviation relation;
The length of length (str) expression notional word Chinese character string str, the i.e. number of contained Chinese character among the str;
N_word (Fn, An) expression appears at the Chinese character number among Fn and the An simultaneously;
Behind N_Clas (Fn) the expression Fn process participle, the participle number of appearance;
The participle number that is covered by An among N_Cover (Fn, An) the expression Fn;
The set of the participle that is covered by An among CoverSet (Fn, An) the expression Fn;
P: the participle that the expression full name comprises;
P1/p2/... / pm: expression is by participle p1, p2 ... the segmentation sequence that pm forms, wherein/separator between the expression participle;
The position of the participle central point of centre (Fn) expression Fn, after namely Fn passes through participle, the position of that middle participle, or the mean place of those middle two participles, centre (Fn)=(N_Clas (Fn)+1)/2;
d
i(Fn) center offset of i the participle of expression Fn, i.e. displacement between the position of i the participle of the position of the participle central point of Fn and Fn, d
i(Fn)=i-centre (Fn);
(Fn) the center of maximum side-play amount of expression Fn, i.e. the center offset ground maximal value of all participles of Fn,
(Fn)=(N_Clas (Fn)-1)/2;
Len
iI not capped contained participle number of participle string of (Fn, An) expression.After Fn carried out participle, those participles that do not covered by An, if link in Fn then form capped participle string, if do not link then independent bunchiness, i the capped contained participle number of participle string is designated as Len
i(Fn, An);
Freq (Fn, An) expression extracts the number of Fn from GoogleArchSet (An);
Represent an infinitesimal number;
The frequency order of loca (Cfn, An) expression Cfn in CfnSet (An), namely the element among the CfnSet (An) is pressed the big or small ascending sort of freq (Cfn, An) after, the order of Cfn;
Any Chinese character string among the S set et of NoInclude (s1, Set) expression Chinese character string is not the substring of Chinese character string s1;
How Interrogative represents interrogative set, comprises " what ", " ", " what ", " " etc.;
Chinese character string after concat (s1, s2) represents Chinese character string s1 and Chinese character string s2 is connected;
Concat (s1 ..., sn) expression Chinese character string s1 ..., the Chinese character string of sn after mutually connecting successively;
Each word among Contain (sl, s2) the expression Chinese character string s2 appears among the Chinese character string s1;
Include (s1, s2) expression Chinese character string s2 is the true substring of Chinese character string s1;
Prefix (s1, s2) expression s1 is with respect to the prefix of s2, and prefix (s1, s2) be sky, i.e. s1=concat (prefix (s1, s2), s2, s3), and wherein s3 can be empty string;
The below describes from 11 aspects to the concrete meaning that constraint function is concentrated:
The word of constraint function 1:An is from the ratio among the Fn.
Generally speaking, full name comprises and is called for short all included Chinese characters.For example, An=" Beijing University ", Fn=" Peking University ", each Chinese character among the An comes among the Fn.Concentrate at candidate's full name, the priority that comprises the higher candidate's full name of the ratio of word of An is higher.
The formal definition of constraint function 1 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, An=" Eight Trigram Palm ", Cfn
1=" eight-diagram palm ", Cfn
2=" a chain of fist of Eight Diagrams ".According to constraint function 1, have
So, Cfn
1Priority ratio Cfn
2Priority high.
The word order of constraint function 2:Fn and An.
In the breviary process, most word orders that keeping in the full name that are called for short.For example, An=" Olympic Games ", Fn=" Olympic Games ", the triliteral order among the An is strictly arranged sequentially by what occur in Fn.
The formal definition of constraint function 2 be calculated as follows (indicate: this function is consistent with patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Attention: Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0.
Constraint function 3:An is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name.
The formal definition of constraint function 3 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, An=" Beijing University ", Cfn
1=" Beijing/university ", Cfn
2=" Beijing/traffic/university ", according to constraint function 3,
So, Cfn
1Priority ratio Cfn
2Priority high.
Constraint function 4:An covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name.For example, An=" your boat group ", Fn=" China/Guizhou/aviation/industry/group/company ", abridged participle " China ", " industry ", " company " are respectively in forward part, center section and the rear section of Fn among the Fn.
The formal definition of constraint function 4 and being calculated as follows:
For example, An=" mountain is large ", Cfn
1=" Shandong/university ", Cfn
2=" Shandong/university/Weihai/branch school ", Cfn
1The middle participle that is covered by An " Shandong " and " university " are evenly distributed on Cfn
1In, and Cfn
2The middle participle that is covered by An " Shandong " and " university " all are distributed in Cfn
2First half.According to constraint function 4,
So, Cfn
1Priority ratio Cfn
2Priority high.
The longest continuative participle number that is not covered by An among the constraint function 5:Fn
Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller.
The formal definition of constraint function 5 and being calculated as follows:
Wherein, N represents the not number of capped participle string contained among the Fn
For example, An=" Communist Youth League ", Cfn
1=" common property/doctrine/Communist Youth League ", Cfn
2=" China/people/republic/common property/doctrine/Communist Youth League ", Cfn
1In the participle that do not covered by An only have " doctrine ", and Cfn
2In participle " China ", " people " and " republic " of not covered by An connect together.According to constraint function 5,
So, Cfn
1Priority ratio Cfn
2Priority high.
The length relation of constraint function 6:Fn and An
Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning.Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less.
The formal definition of constraint function 6 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, An=" Computer Department of the Chinese Academy of Science ", Cfn
1=" Inst. of Computing Techn. Academia Sinica ", Cfn
2=" Inst. of Computing Techn. Academia Sinica's residential building ".According to constraint function 6,
So, Cfn
1Priority ratio Cfn
2Priority high.
The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)
By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher.
The formal definition of constraint function 7 and being calculated as follows:
For example, An=" lithium battery ", Cfn
1=" lithium ion battery ", Cfn
2=" lithium-ion-power cell, Freq (Cfn
1)=42, Freq (Cfn
2)=12, according to constraint function 7,
So, Cfn
1Priority ratio Cfn
2Priority high.
When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN
i, analyze FA(Cfn
i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name.
4 following constraint functions are based on the definition of candidate's full name collection.
The word of constraint function 8:An is from the relative ratios among the Cfn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN, such as, the abbreviation of some external transliteration vocabulary does not just have identical word with full name, has carried out some synonyms when some abbreviation is reduced into full name and has transformed etc.
The formal definition of constraint function 8 and being calculated as follows:
For example, An=" acquired immune deficiency syndrome (AIDS) ", Cfn
1=" aids ", Cfn
2=" acquired immunodeficiency syndrome " is although An and Cfn
1There is not identical word, but An and Cfn
2There is not identical word, so can not be because of Cfn yet
1The value of function 1 be 0 just to judge Cfn
1It or not full name.
The relative coverage ratio that constraint function 9:Fn concentrates at candidate's full name
Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, such as, some abbreviation is not high to the coverage rate of candidate's full name, and the priority of candidate's full name that coverage rate is relatively high so is higher.
The formal definition of constraint function 9 and being calculated as follows:
For example, An=" Tsing Hua Tong Fang ", Cfn
1=" Tsing-Hua University/with side/share/limited/company ", Cfn
2=" Tsing-Hua University/with side/CD/share/limited/company ", although An is to Cfn
1And Cfn
2Word-coverage rate not high, but to Cfn
1Word-coverage rate relatively higher, so Cfn
1Compare Cfn
2It is high that priority is wanted.
The frequency that constraint function 10:Fn concentrates at candidate's full name
When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher.
The formal definition of constraint function 10 and being calculated as follows:
For example, An=" eel connection ", Cfn
1=" world's eel vegetative propagation joint conference ", Cfn
2=" Shantou eel community of stock part company limited ", Freq (Cfn
1)=3, Freq (Cfn
2Although)=1 is according to constraint function 7, Cfn
1And Cfn
2Frequency all lower, but according to constraint function 10, Cfn
1And Cfn
2The frequency of concentrating at candidate's full name is all higher.
Constraint function 11: the element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein
When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low.
The formal definition of constraint function 11 and being calculated as follows:
The importance of candidate's full name that the value of constraint function 11 is lower is lower.
More than the concrete meaning of the constraint function constraint function concentrated from 11 aspects be illustrated, they have represented Fn(or Cfn quantitatively) and An between constraint, axiom of constraint then represents Fn(or Cfn qualitatively) and An between constraint, the below is specifically described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An.
Axiom of constraint 2: indicative mood axiom
Form represents:
How do not comprise interrogative " what ", " ", " what " etc. among meaning: Fn directly perceived and the An.
Axiom of constraint 3: form does not repeat axiom
Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string.
For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet, Jade Flowery Islet ", Cfn is the ss form, s=" Jade Flowery Islet " wherein is so Cfn should be modified to s.This phenomenon why can occur is because do not have punctuation mark to separate between two " Jade Flowery Islets " in the language material.
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Meaning: Fn directly perceived semantically can not repeat.
For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet Hainan Island ", Cfn is the s1s2 form, and s1=" Jade Flowery Islet " wherein, s2=" Hainan Island " is so Cfn is incorrect.This phenomenon why can occur is because of not having punctuation mark to separate between s1 in language material and the s2.
Axiom of constraint 5: entirely be called for short axiom of equal value
Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated.
Axiom of constraint 5 is not used in the checking to full abbreviation relation, and is used for the expansion to full abbreviation relational knowledge base.
In that the full abbreviation relation constraint of the present invention's definition has been done on the basis that describes in detail, with reference to figure 1, specifically introduce the embodiment of the inventive method.
Method according to Chinese abbreviation identification Chinese full name of the present invention comprises two large steps, is respectively to produce candidate's full name collection and candidate's full name collection is carried out aftertreatment, and the below describes them respectively.Because utilize the method for query pattern 1 and query pattern 2 generation candidate full name collection different, so separate introduction.
As shown in Figure 2, utilize the specific implementation step of query pattern 1 generation candidate full name collection as follows:
Step 1-1, user input known Chinese abbreviation An;
Step 1-2, according to query pattern 1: " being called for short An ", construct concrete query term.
Step 1-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 1-4, by regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged.Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction.Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin.The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides.
Step 1-5, utilize algorithm FCFNEA to extract benchmark candidate full name collection.
Extract the algorithm of benchmark candidate full name collection: (formal candidate fullname extract algorithm FCFNEA)
Input: label is called for short the sentence set entirely to type
, entirely be called for short the sentence set without the suffix type
, have the suffix type entirely to be called for short the sentence set
Step4: , utilize ICTCLAS to carry out participle, with first participle
With last participle
Form
,
à
Step5: ,
If,
The middle prefix that exists is that pre and suffix are the entry of suf
, then
à
, from
Middle deletion
, utilize prioritization strategy P
SCFObtain
Best candidate
à
;
The prioritization strategy of in the Step5 of algorithm FCFNEA, using
PSCFBe defined as follows:
Prioritization strategy (priority sort comparison function PSCF)
2).
,
1).
;
Step 1-6, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection.
Extract the algorithm of non-benchmark candidate full name: (informal candidate fullname extract algorithm ICFNEA)
Input: phrase to be extracted or short sentence
Co-referent, the known concept word
Inputitem={C 1 C 2 C n };
Output: the full abbreviation candidate who extracts
Candidate;
Step1:Right
Co-referentCarry out participle and mark part of speech, word segmentation result is:
{ P 1 P 2 P m }
The definition position variable
Left_flag k,
Left1
Step3:?for?each?
C i ∈{C n C n-1 ……C 1 }
for?each?Pj∈?{Pleft_flagPleft_flag-1……P1}
If Ci appears among the Pj
Then left_flag j
break;
end?if
end?for?each
end?for?each
Step4:?for?each?
P k ∈{P 1 P 2 ……P m }
If
P k Part of speech
∈{ conjunction preposition auxiliary word verb measure word label } and
k<
Left_flag
Then
left k+1
;
end?if
end?for?each
Step5:return?
? Candidate?
{P left ……P m };
Border, the left and right sides decided again in candidate's full name that step 1-7, the method for utilizing analogy are concentrated non-benchmark candidate full name.
The method of analogy is specifically seen following method 1 and method 2.
Method 1: form represents:
The meaning directly perceived of method 1: for concentrated any two candidates of candidate's full name
With
If satisfy simultaneously precondition:
4)
With respect to
Prefix be not the prefix that all the other candidates concentrated in candidate's full name
Then
Frequency change into
With
The frequency sum, and will
Concentrate deletion from candidate's full name.
Method 2: form represents:
The meaning directly perceived of method 2: for concentrated any two candidates of candidate's full name
With
If satisfy simultaneously precondition:
Then
Frequency change into
With
The frequency sum, and will
Concentrate deletion from candidate's full name.
Step 1-8, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated.Specific algorithm is as follows:
Utilize LBV and RBV to decide again the algorithm (RDLRB) on border, the left and right sides:
Input: candidate's full name
Cfn, be called for short
An, the left margin vocabulary
LBV, the right margin vocabulary
RBV
Output: decide again the candidate's full name behind the border, the left and right sides
CFN
Utilize ICTCLAS pair
CfnCarry out participle, the result is:
Cfn _ clas=
P 1 P 2 P n
Step2:Determine
AnFirst character and the last character exist
CfnThe middle respectively participle of correspondence
P i With
P j
Definition
CfnLeft margin
Left1;
for?each?
P k ∈{P i-1 ……P 1 }
If
P k In the on the left side circle vocabulary
left k+1;
break;
end?if
end?for?each
Step4:Definition
CfnRight margin
RightN;
for?each?
P k ∈{P j+1 ……P n }
If
P k On the right in boundary's vocabulary
right k-1;
break;
end?if
end?for?each
Step5:? cFN??
{P left ……P right };
Return cFN;
Step 1-9, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.
In above-mentioned steps 1-9, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:
Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj.
Definition 2: each the candidate's full name Cfnk that concentrates for benchmark candidate full name, if each word of An appears among the Cfnk, then Cfnk is carried out participle with ICTCLAS after
, establish first character that participle Fi and Fj are respectively An and the last character corresponding participle in Cfnk, note
The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk,
The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk.
Definition 3: suspicious left margin vocabulary (dubious left boundary vocabulary DLBV):
Formal definition:
map<key,map_value>?dubious_left_boundary;
Key:string prefix: recently left part word or recently prefix word
Map_value:int qu:prefix is as the frequency of nearest left part word
Int liu:prefix is as the frequency of nearest prefix word
Whether bool flag:prefix needs is manually verified
Definition 4: suspicious right margin vocabulary (dubious right boundary vocabulary DRBV):
map<key,map_value>?dubious_left_boundary;
Key:string suffix: recently right part word or recently suffix word
Map_value:int qu:suffix is as the frequency of nearest right part word
Int liu:suffix is as the frequency of nearest suffix word
Whether bool flag:suffix needs is manually verified
Step 1-10, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary.
The method of the suspicious left margin vocabulary of artificial checking is as follows:
1) prefix manually verifies
2) prefix is as the frequency of nearest left part word〉2
3) frequency of the nearest prefix word of prefix conduct<2
4) prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word
Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word.
Definition 5: left margin vocabulary (left boundary vocabulary LBV):
Formal definition:
map<key,map_value>?left_boundary;
Key:string prefix: left margin word
Map_value:int num: utilize prefix to determine the Cfn number of left margin
The method of the suspicious right margin vocabulary of artificial checking is as follows:
1) suffix manually verifies
2) suffix is as the frequency of nearest right part word〉2
3) frequency of the nearest suffix word of suffix conduct<2
4) suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word
Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary.
Definition 6: right margin vocabulary (right boundary vocabulary RBV):
map<key,map_value> right_boundary_cfn;
Key:string suffix: right margin word
Map_value:int num: utilize suffix to determine the Cfn number of right margin
Step 1-11, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection.
We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the source language material, account for 24.76% with what query pattern 2 can obtain the source language material, query pattern 2 can obtain the source language material and only account for 2.33% with what query pattern 1 can not obtain the source language material, so, 2 of query patterns are namely only just used query pattern 2 as the replenishing of query pattern 1 when query pattern 1 obtains less than candidate's full name in the present invention.
As shown in Figure 3, utilize the specific implementation step of query pattern 2 generation candidate primitive collection as follows:
Step 2-1, user input known Chinese abbreviation An;
Step 2-2, according to query pattern 2: " An full name ", construct concrete query term.
Step 2-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 2-4, by the structure regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 2-5, utilize algorithm CFNEA to extract candidate's full name collection.
Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA)
Input: prefix
Prefix, known abbreviation?
Inputitem, phrase to be extracted or short sentence
Co-referent
Output: the full abbreviation candidate who extracts
Candidate
Step1: Defined label
Flag0, (that increases income seemingly can not be used for commercial object) is right
Co-referentParticiple is designated as:
{ P 1 P 2 P n }
Step2: for?each?P
i∈?
{P 1 P 2 ……P n }
If
Flag=0 and P
iWith
PrefixIdentical word and P is arranged
iWith
InputitemWithout identical word
Then
flag?1
;
end?if
If
Flag=1And
P i With
PrefixWithout identical word
Then
break;
end?if
If
P i With
InputitemIdentical word is arranged
Then
break;
end?if
end?for?each
Step3: if
flag=0 Then?
?i?0
;
Step4: Candidate?
{P i ……P n }
Return
Candidate
Obtain candidate's full name collection by aforesaid operations, then candidate's full name collection is carried out aftertreatment, obtain final result, aftertreatment comprises to be verified, classifies and sort candidate's full name, and with reference to figure 4, its specific implementation step is as follows:
Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated.
Step C-2, generate the decision tree (see figure 5) by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, and removing classification is candidate's full name of " N ", and retention class is that candidate's full name of " Y " generates the full name collection.
In Fig. 5, the different font mistake of " N1 " expression low frequency, the different font mistake of " N2 " expression high frequency, the different order type of " N3 " expression low frequency mistake, " Y " expression is correct.
Step C-3, the full name collection carried out the classification of Constraint-based collection of functions.
According to full name whether different word or different order are arranged in the present invention, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type (seeing Table 1) according to An to the covering center of gravity of FN.
The condition that concrete criteria for classification and all kinds of full name need to satisfy (seeing Table 2).
Classification | Need satisfied condition |
The strong linguistic context of high frequency is irrelevant | f 1=1 f 2=1 f 3=1 f 11=1 |
The strong linguistic context of low frequency is irrelevant | f 1=1 f 2=1 f 3=1 f 11< 1 |
The weak linguistic context of high frequency is irrelevant | f 1=1 f 2=1 0.823 f 3<1 f 9=1 f 11=1 |
The weak linguistic context of low frequency is irrelevant | f 1=1 f 2=1 0.823 f 3<1 f 9=1 f 11<1 |
Forward direction type linguistic context is relevant | f 1=1 f 2=1 f 3 1 f 4 0.5 |
Type linguistic context placed in the middle is relevant | f 1=1 f 2=1 0.5 f 4 0.5 (f 3 0.823 f 9 1) |
The backward type linguistic context is relevant | f 1=1 f 2=1 f 3 1 f 4 0.5 |
Different order type | f 1=1 f 2=0 f 11=1 |
Different font | f 1 1 f 7 f 10 f 7 0.05 f 9=1 f 11=1)) |
Notice because linguistic context is the concept of a semantic level, whether linguistic context is relevant so be difficult to judge a FN with computer intelligence ground, the judgement that utilizes constraint function to be similar to from the word-building rule aspect among the present invention.
In the form 2, the meaning directly perceived that the strong linguistic context of high frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.
In the form 2, the meaning directly perceived that the strong linguistic context of low frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of high frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of low frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.
In the form 2, the meaning directly perceived that forward direction type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the latter half of FN.
In the form 2, the irrelevant meaning directly perceived of type linguistic context placed in the middle: FN comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the FN is similar.
In the form 2, the meaning directly perceived that the backward type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the first half of FN.
In the form 2, the meaning directly perceived of different order type: FN comprises all words among the An but word order has change, and FN concentrates frequency the highest at full name.
In the form 2, the meaning directly perceived of different font: FN does not comprise all words among the An but the frequency of FN is very high or the relative frequency concentrated at full name is very high.
Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name.
The priority comprehensive function PRI (Cfn, An) that uses in step C-4 is defined as follows:
Wherein,
,
Be the weight that each function is taked when the comprehensive evaluation, F
iWith
Between corresponding relation see Table 3,
Size obtain by experiment according to the degree of restraint of each function to full abbreviation relation:
Form
Numbering | The function content | The function weight |
F 1 | The word of An is from the ratio among the Fn | 0.12 |
F 2 | The word order of Fn and An | 0.08 |
F 3 | An is to the word-coverage rate of Fn | 0.06 |
F 4 | An covers center of gravity to the participle of Fn | 0.08 |
F 5 | The longest continuative participle number that is not covered by An among the Fn | 0.04 |
F 6 | The length relation of Fn and An | 0.06 |
F 7 | The frequency that Fn occurs in GoogleArchSet (An) | 0.10 |
F 8 | The word of An is from the relative ratios among the Cfn | 0.12 |
F 9 | The relative coverage ratio that Fn concentrates at candidate's full name | 0.10 |
F 10 | The frequency that Fn concentrates at candidate's full name | 0.12 |
F 11 | The element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein | 0.14 |
For actual effect of the present invention is described, adopt method of the present invention to look for full name to do great many of experiments to multidisciplinary abbreviation.We have randomly drawed 3910 Chinese An from multidisciplinary, utilize the present invention to search its Fn, the results are shown in form 4.
The An number | Get access to the An number of Fn | Get access to the number percent of the An of Fn | The number of all Fn | Search the exact rate (sampling) of Fn |
3910 | 3561 | 91.07% | 9305 | 94.77% |
We have randomly drawed 3188 full name and have verified with decision tree from above-mentioned experiment, table 5 is results of decision tree checking.
Can draw the following conclusions by experiment: the present invention has preferably recognition effect to the identification of Chinese full name, and is applied widely, can finely remedy the defective of the upper previous methods of Chinese full name identification.
Claims (10)
1. method of obtaining the Chinese full name from the Web webpage is characterized in that: comprise step:
Step 1, given Chinese abbreviation of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EFN and extract candidate's full name from full abbreviation language materials, form the set of candidate's full name;
Step 5, checking based on full abbreviation relation constraint is carried out in candidate's full name set, formed the full name set;
Step 6, classification based on full abbreviation relation constraint is carried out in full name set, thereby formed the full name set with the classification mark.
2. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 2, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
3. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name "; Select first query pattern 1, next selects query pattern 2.
4. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 4, full name extraction algorithm EFN comprises two algorithm CFNEA1 and CFNEA2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt CFNEA1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt CFNEA2 to extract Fn in the step 4.
5.
A kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 1, step 4 was carried out following steps:
The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged; Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction; Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin; The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides;
Steps A-1, utilize algorithm FCFNEA to extract benchmark candidate full name collection;
Extract the algorithm of benchmark candidate full name collection: (formal candidate fullname extract algorithm FCFNEA)
Input: label is called for short the sentence set entirely to type
, entirely be called for short the sentence set without the suffix type
, have the suffix type entirely to be called for short the sentence set
Output: benchmark candidate full name set
,
If,
The middle prefix that exists is that pre and suffix are the entry of suf
, then
à
, from
Middle deletion
, utilize prioritization strategy P
SCFObtain
Best candidate
à
;
The prioritization strategy PSCF that uses in the Step5 of algorithm FCFNEA is defined as follows:
Prioritization strategy (priority sort comparison function PSCF)
1).
;
Steps A-2, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection;
Extract the algorithm of non-benchmark candidate full name: (informal candidate fullname extract algorithm ICFNEA)
Input: phrase to be extracted or short sentence
Co-referent, the known concept word
Inputitem={C 1 C 2 C n };
Output: the full abbreviation candidate who extracts
Candidate;
Right
Co-referentCarry out participle and mark part of speech, word segmentation result is:
{ P 1 P 2 P m }
The definition position variable
Left_flag k,
Left1
for?each?
C i ∈{C n C n-1 ……C 1 }
for?each?
P j ∈{P left_flag P left_flag-1 ……P 1 }
If
C i Appear at
P j In
Then
left_flag?
?j
break;
end?if
end?for?each
end?for?each
for?each?
P k ∈{P 1 P 2 ……P m }
If
P k Part of speech
∈{ conjunction preposition auxiliary word verb measure word label } and
k<
Left_flag
Then
left k+1
;
end?if
end?for?each
return
Candidate?
{P left ……P m };
Border, the left and right sides decided again in candidate's full name that steps A-3, the method for utilizing analogy are concentrated non-benchmark candidate full name;
The method of analogy is specifically seen following method 1 and method 2;
Form represents:
The meaning directly perceived of method 1: for concentrated any two candidates of candidate's full name
With
If satisfy simultaneously precondition:
Chinese character among the An all appears at
In
With respect to
Prefix be not the prefix that all the other candidates concentrated in candidate's full name
Then
Frequency change into
With
The frequency sum, and will
Concentrate deletion from candidate's full name;
Form represents:
The meaning directly perceived of method 2: for concentrated any two candidates of candidate's full name
With
If satisfy simultaneously precondition:
Then
Frequency change into
With
The frequency sum, and will
Concentrate deletion from candidate's full name;
Steps A-4, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated; Specific algorithm is as follows:
Utilize LBV and RBV to decide again the algorithm (RDLRB) on border, the left and right sides:
Input: candidate's full name
Cfn, be called for short
An, the left margin vocabulary
LBV, the right margin vocabulary
RBV
Output: decide again the candidate's full name behind the border, the left and right sides
CFN
Utilize ICTCLAS pair
CfnCarry out participle, the result is:
Cfn _ clas=
P 1 P 2 P n
Determine
AnFirst character and the last character exist
CfnThe middle respectively participle of correspondence
P i With
P j
Definition
CfnLeft margin
Left1;
for?each?
P k ∈{P i-1 ……P 1 }
If
P k In the on the left side circle vocabulary
left? k+1;
break;
end?if
end?for?each
Definition
CfnRight margin
RightN;
for?each?
P k ∈{P j+1 ……P n }
If
P k On the right in boundary's vocabulary
right k-1;
break;
end?if
end?for?each
cFN??
{P left ……P right };
Return cFN;
Steps A-5, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary; Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary;
In above-mentioned steps A-5, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:
Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj;
Definition 2: each the candidate's full name Cfnk that concentrates for benchmark candidate full name, if each word of An appears among the Cfnk, then Cfnk is carried out participle with ICTCLAS after
, establish first character that participle Fi and Fj are respectively An and the last character corresponding participle in Cfnk, note
The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk,
The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk;
Definition 3: suspicious left margin vocabulary (dubious left boundary vocabulary DLBV):
Formal definition:
map<key,map_value>?dubious_left_boundary;
Key:string prefix: recently left part word or recently prefix word
Map_value:int qu:prefix is as the frequency of nearest left part word
Int liu:prefix is as the frequency of nearest prefix word
Whether bool flag:prefix needs is manually verified
Definition 4: suspicious right margin vocabulary (dubious right boundary vocabulary DRBV):
map<key,map_value>?dubious_left_boundary;
Key:string suffix: recently right part word or recently suffix word
Map_value:int qu:suffix is as the frequency of nearest right part word
Int liu:suffix is as the frequency of nearest suffix word
Whether bool flag:suffix needs is manually verified
Steps A-6, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary;
The method of the suspicious left margin vocabulary of artificial checking is as follows:
Prefix manually verifies
Prefix is as the frequency of nearest left part word〉2
The prefix conduct is the frequency of prefix word<2 recently
Prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word
Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word;
Definition 5: left margin vocabulary (left boundary vocabulary LBV):
Formal definition:
map<key,map_value>?left_boundary;
Key:string prefix: left margin word
Map_value:int num: utilize prefix to determine the Cfn number of left margin
The method of the suspicious right margin vocabulary of artificial checking is as follows:
If satisfy:
Suffix manually verifies
Suffix is as the frequency of nearest right part word〉2
The suffix conduct is the frequency of suffix word<2 recently
Suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word
Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary;
Definition 6: right margin vocabulary (right boundary vocabulary RBV):
map<key,map_value> right_boundary_cfn;
Key:string suffix: right margin word
Map_value:int num: utilize suffix to determine the Cfn number of right margin
Steps A-7, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection;
Steps A-1 to steps A-2 forms algorithm CFNEA1.
6. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 2, step 4 was carried out following steps:
Step B-1, utilize algorithm CFNEA2 to extract candidate's full name collection;
Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA2)
Input: prefix
Prefix, known abbreviation?
Inputitem, phrase to be extracted or short sentence
Co-referent
Output: the full abbreviation candidate who extracts
Candidate
Defined label
Flag0, (that increases income seemingly can not be used for commercial object) is right
Co-referentParticiple is designated as:
{ P 1 P 2 P n }
for?each?P
i∈?
{P 1 P 2 ……P n }
If
Flag=0 and P
iWith
PrefixIdentical word and P is arranged
iWith
InputitemWithout identical word
Then
flag?1
;
end?if
If
Flag=1And
P i With
PrefixWithout identical word
Then
break;
end?if
If
P i With
InputitemIdentical word is arranged
Then
break;
end?if
end?for?each
if
flag=0 Then?
?i?0
;
Candidate?
{P i ……P n }
Return
Candidate
Obtain candidate's full name collection by aforesaid operations.
7. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.
8. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in described step 5, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy; The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.
9. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 8 is characterized in that: described step 5,6 specific implementation step are as follows:
Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated;
Step C-2, generate decision tree by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, removing classification is candidate's full name of " F1 ", " F2 " and " F3 ", and retention class is candidate's full name of " T ", thereby generates the full name collection;
The different font mistake of " F1 " expression low frequency, the different font mistake of " F2 " expression high frequency, the different order type of " F3 " expression low frequency mistake, " Y " expression is correct;
Step C-3, the full name collection carried out the classification of Constraint-based collection of functions;
According to full name whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type according to An to the covering center of gravity of FN;
The condition that concrete criteria for classification and all kinds of full name need to satisfy:
The meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;
The meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;
The irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;
The irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;
The meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn;
The irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar;
The meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn;
The meaning directly perceived of different order type: Fn comprises all words among the An but word order has change, and Fn concentrates frequency the highest at full name;
The meaning directly perceived of different font: Fn does not comprise all words among the An but the frequency of Fn is very high or the relative frequency concentrated at full name is very high;
Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name;
The priority comprehensive function PRI (Cfn, An) that uses in step C-4 is defined as follows:
10. require 8 or 9 described a kind of methods of obtaining the Chinese full name from the Web webpage according to claim, it is characterized in that: the concrete meaning of described constraint function collection is:
The word of constraint function 1:An is from the ratio among the Fn
Full name comprises and is called for short all included Chinese characters, and namely each Chinese character among the An comes among the Fn, concentrates at candidate's full name, and the priority that comprises the higher candidate's full name of the ratio of word of An is higher;
The formal definition of constraint function 1 and being calculated as follows:
The word order of constraint function 2:Fn and An
In the breviary process, most word orders that keeping in the full name that are called for short, the order of word is strictly arranged sequentially by what occur in Fn among the An;
The formal definition of constraint function 2 and being calculated as follows:
Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0;
Constraint function 3:An is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name;
The formal definition of constraint function 3 and being calculated as follows:
Constraint function 4:An covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name;
The formal definition of constraint function 4 and being calculated as follows:
The longest continuative participle number that is not covered by An among the constraint function 5:Fn
Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller;
The formal definition of constraint function 5 and being calculated as follows:
Wherein, N represents the not number of capped participle string contained among the Fn;
The length relation of constraint function 6:Fn and An
Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning; Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less;
The formal definition of constraint function 6 and being calculated as follows:
The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)
By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher;
The formal definition of constraint function 7 and being calculated as follows:
When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN
i, analyze FA(Cfn
i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name;
4 following constraint functions are based on the definition of candidate's full name collection:
The word of constraint function 8:An is from the relative ratios among the Cfn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN;
The formal definition of constraint function 8 and being calculated as follows:
The relative coverage ratio that constraint function 9:Fn concentrates at candidate's full name
Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, if some abbreviation is not high to the coverage rate of candidate's full name, the priority of candidate's full name that coverage rate is relatively high so is higher;
The formal definition of constraint function 9 and being calculated as follows:
The frequency that constraint function 10:Fn concentrates at candidate's full name
When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher;
The formal definition of constraint function 10 and being calculated as follows:
Constraint function 11: the element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein
When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low;
The formal definition of constraint function 11 and being calculated as follows:
The importance of candidate's full name that the value of constraint function 11 is lower is lower;
The concrete meaning of described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An;
Axiom of constraint 2: indicative mood axiom
Form represents:
Do not comprise interrogative among meaning: Fn directly perceived and the An;
Axiom of constraint 3: form does not repeat axiom
Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string;
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Meaning: Fn directly perceived semantically can not repeat;
Axiom of constraint 5: entirely be called for short axiom of equal value
Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated;
Axiom of constraint 5 is not used in the checking to full abbreviation relation, and is used for the expansion to full abbreviation relational knowledge base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102531001A CN102955818A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring full names in Chinese from Web page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102531001A CN102955818A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring full names in Chinese from Web page |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102955818A true CN102955818A (en) | 2013-03-06 |
Family
ID=47764629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011102531001A Pending CN102955818A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring full names in Chinese from Web page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102955818A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463696A (en) * | 2017-08-15 | 2017-12-12 | 中译语通科技(北京)有限公司 | A kind of method of Webpage largest block extraction |
CN108460016A (en) * | 2018-02-09 | 2018-08-28 | 中云开源数据技术(上海)有限公司 | A kind of entity name analysis recognition method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1607493A (en) * | 2003-09-24 | 2005-04-20 | 王子尧 | Chinese character unit whole tone code fetch input method |
-
2011
- 2011-08-31 CN CN2011102531001A patent/CN102955818A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1607493A (en) * | 2003-09-24 | 2005-04-20 | 王子尧 | Chinese character unit whole tone code fetch input method |
Non-Patent Citations (2)
Title |
---|
LI-XING XIE EL AT.: "《EXTRACTING CHINESE ABBREVIATION-DEFINITION PAIRS FROM ANCHOR TEXTS》", 《IEEE:PROCEEDINGS OF THE 2011 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS》 * |
谢丽星等: "《基于用户查询日志和锚文字的汉语缩略语识别》", 《中国计算机语言学研究前沿进展》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463696A (en) * | 2017-08-15 | 2017-12-12 | 中译语通科技(北京)有限公司 | A kind of method of Webpage largest block extraction |
CN108460016A (en) * | 2018-02-09 | 2018-08-28 | 中云开源数据技术(上海)有限公司 | A kind of entity name analysis recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN105426539B (en) | A kind of lucene Chinese word cutting method based on dictionary | |
CN103500160B (en) | A kind of syntactic analysis method based on the semantic String matching that slides | |
CN104008092B (en) | Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping | |
CN102651003B (en) | Cross-language searching method and device | |
CN101118538B (en) | Method and system for recognizing feature lexical item in Chinese naming entity | |
CN103235774A (en) | Extraction method of feature words of science and technology project application form | |
CN103699529A (en) | Method and device for fusing machine translation systems by aid of word sense disambiguation | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN109376352A (en) | A kind of patent text modeling method based on word2vec and semantic similarity | |
CN105138514A (en) | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction | |
CN108549625B (en) | Chinese chapter expression theme analysis method based on syntactic object clustering | |
CN105808711A (en) | System and method for generating model based on semantic text concept | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
Zhang et al. | Rule-based extraction of spatial relations in natural language text | |
CN109614620A (en) | A kind of graph model Word sense disambiguation method and system based on HowNet | |
CN104391837A (en) | Intelligent grammatical analysis method based on case semantics | |
CN110390022A (en) | A kind of professional knowledge map construction method of automation | |
CN104598441B (en) | A kind of method that computer splits Chinese sentence | |
CN102955819A (en) | Method for acquiring shortened form in Chinese from Web page | |
CN103258032A (en) | Parallel webpage obtaining method and parallel webpage obtaining device | |
CN102955818A (en) | Method for acquiring full names in Chinese from Web page | |
CN102982063A (en) | Control method based on tuple elaboration of relation keywords extension | |
Rondon et al. | Never-ending multiword expressions learning | |
Zamin et al. | A statistical dictionary-based word alignment algorithm: An unsupervised approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130306 |