CN102955818A - Method for acquiring full names in Chinese from Web page - Google Patents

Method for acquiring full names in Chinese from Web page Download PDF

Info

Publication number
CN102955818A
CN102955818A CN2011102531001A CN201110253100A CN102955818A CN 102955818 A CN102955818 A CN 102955818A CN 2011102531001 A CN2011102531001 A CN 2011102531001A CN 201110253100 A CN201110253100 A CN 201110253100A CN 102955818 A CN102955818 A CN 102955818A
Authority
CN
China
Prior art keywords
full name
candidate
word
frequency
cfn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102531001A
Other languages
Chinese (zh)
Inventor
王石
丁远钧
符建辉
王卫民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd filed Critical KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN2011102531001A priority Critical patent/CN102955818A/en
Publication of CN102955818A publication Critical patent/CN102955818A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a method for acquiring full names in Chinese from a Web page. The method comprises the steps of: inputting a known short form, selecting a query mode to establish a query item, submitting the query item to Google for acquiring an anchor text, then acquiring the corpus of the full names and the short forms from the anchor text, finally picking up candidate short forms by utilizing pick-up algorithms, and then sequencing the candidate short forms by utilizing the priority synthetic function, wherein two query modes are related, and two corresponding pick-up algorithms for picking up full name are used. The invention also defines a body of the relation between the full name and the short form, and the body comprises a set of constraint axiom and a constraint function set, wherein the constraint axiom qualitatively expresses the constraint between the full name and the short form, the constraint function set quantitatively expresses the constraint between the full name and the short form; moreover, based on the body of the relation between the full name and the short form, a full name testing method and a full name classification method are proposed. The method can realize large-scale and high-accuracy acquisition of the full names, and discusses the classification of the full names by using a computer, thereby providing an effective support for intelligent acquisition of extensive knowledge.

Description

A kind of method of from the Web webpage, obtaining the Chinese full name
Technical field
The full name that the present invention relates to Chinese information processing and information retrieval field obtains technology, relates in particular to a kind of method of obtaining the Chinese full name from the Web webpage, obtains the method for the Chinese full name of multidisciplinary, extensive, high-accuracy from the Web webpage.
Background technology
Natural language processing is a major issue in computer science and the artificial intelligence field.Its research can realize carrying out with natural language between people and the computing machine various theories and the method for efficient communication.Widespread use along with computing machine and internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, the object of natural language processing is also processed from the small-scale restricted language and is turned to extensive real text to process, and its research will produce far-reaching influence to people's life.
Chinese information processing is to study how to utilize computing machine that Chinese information is processed automatically.Chinese is that a meaning is closed language, compares with western language, lacks explicit mark, and grammer, semanteme, pragmatic side are also more flexible, have increased the difficulty of computer understanding and processing, allow computing machine can process Chinese information, still have many difficulties to overcome.At present, Chinese information processing has obtained some achievements in fields such as speech recognition, participle, mechanical translation.The lifting of Chinese information robotization degree for the treatment of will bring considerable benefit to the science and technology of China, culture, economy, safety etc.
How quick from the bulk information of numerous and complicated Research into information retrieval is, the technology of Obtaining Accurate information needed.Information retrieval technique is through for many years development, and quite ripe at present, the novel information retrieval technique is just towards future developments such as intellectuality, mobilism, variation, personalizations.
Full name (Full Name, Fn) be complete address to title, be called for short (Abbreviation, An) to be brevity and lucidity in order expressing, and the address that obtains after the compression to be simplified in full name, if Fn and An have full abbreviation relation, claim that then Fn is the full name of An, An is the abbreviation of Fn, is denoted as FA(Fn, An).By full name to being called for short, can be regarded as the compression process of a quantity of information, by being called for short to full name, then can be regarded as the process of a decompress(ion), for example: c1=" Inst. of Computing Techn. Academia Sinica " is compressed, obtain c2=" institute is calculated by the Chinese Academy of Sciences ", again c2 is compressed, obtain c3=" Computer Department of the Chinese Academy of Science ", the c3 decompress(ion) is obtained c2, again the c2 decompress(ion) is obtained c1.Full name all is relative concept with being called for short, and such as in upper example, c2 is to be called for short with respect to c1, but is full name with respect to c3, says that separately c2 is full name or to be called for short all be nonsensical.
The full Relation acquisition that is called for short obtains (Knowledge Acquisition from Text as text knowledge, KAT) and information retrieval etc. use in a basic and crucial problem, its acquisition methods can be divided into two large classes: a class is based on the method for pattern, mainly utilize linguistics and natural language processing technique, extract relation schema by lexical analysis and grammatical analysis, then utilize pattern match to obtain full abbreviation relation, the method accuracy rate depends on linguistic knowledge and pattern base; The another kind of method that is based on statistics mainly based on corpus and statistical language model, is obtained full abbreviation relation by the degree of association of calculating between the concept, and the method accuracy rate and efficient are difficult to the real requirement that reaches desirable.The full problem of obtaining that is called for short relation again can be from two angles: one is the angle of excavating, and it is right to obtain full abbreviation exactly under the condition that does not have extraneous input; Another is the angle of searching, and known exactly full name looks for abbreviation or known abbreviation to look for full name.
" full name " mentioned among the present invention or " abbreviation " if no special instructions, all refer to Chinese full name or Chinese abbreviation.
Summary of the invention
For the limitation or the not high defective of accuracy rate that have in the existing full abbreviation Relation acquisition technology, the invention provides a kind of accuracy rate height and be applicable to multidisciplinary, ultra-large a kind of method of from the Web webpage, obtaining the Chinese full name.
In order to address the above problem, the invention provides a kind of method of from the Web webpage, obtaining the Chinese full name, comprise step:
Step 1, given Chinese abbreviation of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EFN and extract candidate's full name from full abbreviation language materials, form the set of candidate's full name;
Step 5, checking based on full abbreviation relation constraint is carried out in candidate's full name set, formed the full name set;
Step 6, classification based on full abbreviation relation constraint is carried out in full name set, thereby formed the full name set with the classification mark.
In the technique scheme, in described step 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name ".We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the anchor language material, account for 24.76% with what query pattern 2 can obtain the anchor language material, account for 91.07% with what query pattern 1 or query pattern 2 can obtain the anchor language material.Therefore, in order to improve search efficiency, we preferentially select query pattern 1, and next selects query pattern 2.
In the technique scheme, in described step 4, full name extraction algorithm EFN comprises two algorithm EFN1 and EFN2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt EFN1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt EFN2 to extract Fn in the step 4.
In the technique scheme, in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.
In the technique scheme, in described step 5), entirely being called for short relation constraint is four-tuple R=(Fn, an An, F, A), wherein, Fn is the full name of object, An is the abbreviation of object, and F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Hereinafter will further make an explanation to these two kinds of constraints.
Beneficial effect: the present invention is the abbreviation that obtains its correspondence according to known full name from Web, namely obtain full abbreviation relation from the angle of searching, utilizing the schema-based method to come to obtain the candidate from Google is called for short, utilization comes candidate's abbreviation is verified based on the method for statistics, have multidisciplinary property, extensive, high accuracy for examination, and inquired into the classification that is called for short with computer realization, obtaining for the intelligence of extensive knowledge provides effective support.
Description of drawings
Fig. 1 serves as reasons and is called for short the total synoptic diagram that obtains full name:
Fig. 2 utilizes query pattern 1 to obtain the process flow diagram of full name:
Fig. 3 utilizes query pattern 2 to obtain the process flow diagram of full name;
The process flow diagram of Fig. 4 for candidate's full name collection is carried out aftertreatment;
Fig. 5 checking decision tree that the full constraint function collection that is called for short generates of serving as reasons.
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments:
Before method of the present invention is described, at first the formation rule and the word formation that are called for short in the full abbreviation relation are put in order and summed up.Be called for short in the relation complete, can be regarded as the compression process of a quantity of information to the process that is called for short by full name, in the compression process of quantity of information, sometimes have semantic equivalence conversion and the adjustment of word order, be divided into plain edition, different font and different order type so we will be called for short relation entirely.
Plain edition: each word in the abbreviation appears in the full name, and keeps their orders in full name, for example, and Fn=" People's Republic of China (PRC) ", An=" China ";
Different font: some word in the abbreviation does not occur in full name, has namely not only carried out the compression of quantity of information by full name to being called for short, and has also carried out semantic equivalence conversion, Fn=" Wa Huang Shengmumiao " for example, An=" Chinese mythology goddess mausoleum ";
Different order type: the order in the abbreviation between Chinese character is inconsistent with their orders of tie element in full name, for example, Fn=" Harbin the 6th pharmaceutical factory ", An=" breathes out medicine six factories ".
In the present invention, define full abbreviation relation constraint and represented constraint between Fn and the An, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Before constraint function collection and axiom of constraint collection are elaborated, be listed in the basic symbol that hereinafter uses:
An represents to be called for short;
Cfn represents candidate's full name of An;
Fn represents the full name of An;
The Google anchor text set of GoogleArchSet (An) expression An, this set of front 100 the anchor Chinese language that return when namely from Google, searching full name corresponding to An, if the anchor text that returns sum N is less than 100, then GoogleArchSet (An) only comprises only N bar anchor text;
Candidate's full name collection of CfnSet (An) expression An, the set that candidate's full name that the An that namely extracts from GoogleArchSet (An) is corresponding forms;
The number of contained candidate's full name among N_CfnSet (An) the expression CfnSet (An);
The full name collection of FnSet (An) expression An, i.e. the set of all elements among the CfnSet (An) through forming after the checking;
The abbreviation collection of AnSet (Fn) expression Fn, namely for given Fn, the correspondence of obtaining from Google is called for short the set that forms;
FA (Fn, An) expression Fn and An have full abbreviation relation;
The length of length (str) expression notional word Chinese character string str, the i.e. number of contained Chinese character among the str;
N_word (Fn, An) expression appears at the Chinese character number among Fn and the An simultaneously;
Behind N_Clas (Fn) the expression Fn process participle, the participle number of appearance;
The participle number that is covered by An among N_Cover (Fn, An) the expression Fn;
The set of the participle that is covered by An among CoverSet (Fn, An) the expression Fn;
P: the participle that the expression full name comprises;
P1/p2/... / pm: expression is by participle p1, p2 ... the segmentation sequence that pm forms, wherein/separator between the expression participle;
The position of the participle central point of centre (Fn) expression Fn, after namely Fn passes through participle, the position of that middle participle, or the mean place of those middle two participles, centre (Fn)=(N_Clas (Fn)+1)/2;
d i(Fn) center offset of i the participle of expression Fn, i.e. displacement between the position of i the participle of the position of the participle central point of Fn and Fn, d i(Fn)=i-centre (Fn);
Figure 239751DEST_PATH_IMAGE001
(Fn) the center of maximum side-play amount of expression Fn, i.e. the center offset ground maximal value of all participles of Fn, (Fn)=(N_Clas (Fn)-1)/2;
Len iI not capped contained participle number of participle string of (Fn, An) expression.After Fn carried out participle, those participles that do not covered by An, if link in Fn then form capped participle string, if do not link then independent bunchiness, i the capped contained participle number of participle string is designated as Len i(Fn, An);
Freq (Fn, An) expression extracts the number of Fn from GoogleArchSet (An);
Represent an infinitesimal number;
The frequency order of loca (Cfn, An) expression Cfn in CfnSet (An), namely the element among the CfnSet (An) is pressed the big or small ascending sort of freq (Cfn, An) after, the order of Cfn;
Any Chinese character string among the S set et of NoInclude (s1, Set) expression Chinese character string is not the substring of Chinese character string s1;
How Interrogative represents interrogative set, comprises " what ", " ", " what ", " " etc.;
Chinese character string after concat (s1, s2) represents Chinese character string s1 and Chinese character string s2 is connected;
Concat (s1 ..., sn) expression Chinese character string s1 ..., the Chinese character string of sn after mutually connecting successively;
Each word among Contain (sl, s2) the expression Chinese character string s2 appears among the Chinese character string s1;
Include (s1, s2) expression Chinese character string s2 is the true substring of Chinese character string s1;
Prefix (s1, s2) expression s1 is with respect to the prefix of s2, and prefix (s1, s2) be sky, i.e. s1=concat (prefix (s1, s2), s2, s3), and wherein s3 can be empty string;
Figure 444970DEST_PATH_IMAGE003
Expression will From
Figure 42491DEST_PATH_IMAGE005
Middle deletion.
The below describes from 11 aspects to the concrete meaning that constraint function is concentrated:
The word of constraint function 1:An is from the ratio among the Fn.
Generally speaking, full name comprises and is called for short all included Chinese characters.For example, An=" Beijing University ", Fn=" Peking University ", each Chinese character among the An comes among the Fn.Concentrate at candidate's full name, the priority that comprises the higher candidate's full name of the ratio of word of An is higher.
The formal definition of constraint function 1 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Figure 258709DEST_PATH_IMAGE006
For example, An=" Eight Trigram Palm ", Cfn 1=" eight-diagram palm ", Cfn 2=" a chain of fist of Eight Diagrams ".According to constraint function 1, have
Figure 610055DEST_PATH_IMAGE007
So, Cfn 1Priority ratio Cfn 2Priority high.
The word order of constraint function 2:Fn and An.
In the breviary process, most word orders that keeping in the full name that are called for short.For example, An=" Olympic Games ", Fn=" Olympic Games ", the triliteral order among the An is strictly arranged sequentially by what occur in Fn.
The formal definition of constraint function 2 be calculated as follows (indicate: this function is consistent with patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Figure 962539DEST_PATH_IMAGE008
Attention: Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0.
Constraint function 3:An is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name.
The formal definition of constraint function 3 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Figure 507352DEST_PATH_IMAGE009
For example, An=" Beijing University ", Cfn 1=" Beijing/university ", Cfn 2=" Beijing/traffic/university ", according to constraint function 3, So, Cfn 1Priority ratio Cfn 2Priority high.
Constraint function 4:An covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name.For example, An=" your boat group ", Fn=" China/Guizhou/aviation/industry/group/company ", abridged participle " China ", " industry ", " company " are respectively in forward part, center section and the rear section of Fn among the Fn.
The formal definition of constraint function 4 and being calculated as follows:
Figure 834745DEST_PATH_IMAGE011
For example, An=" mountain is large ", Cfn 1=" Shandong/university ", Cfn 2=" Shandong/university/Weihai/branch school ", Cfn 1The middle participle that is covered by An " Shandong " and " university " are evenly distributed on Cfn 1In, and Cfn 2The middle participle that is covered by An " Shandong " and " university " all are distributed in Cfn 2First half.According to constraint function 4,
Figure 900790DEST_PATH_IMAGE012
So, Cfn 1Priority ratio Cfn 2Priority high.
The longest continuative participle number that is not covered by An among the constraint function 5:Fn
Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller.
The formal definition of constraint function 5 and being calculated as follows:
Figure 748660DEST_PATH_IMAGE013
Wherein, N represents the not number of capped participle string contained among the Fn
For example, An=" Communist Youth League ", Cfn 1=" common property/doctrine/Communist Youth League ", Cfn 2=" China/people/republic/common property/doctrine/Communist Youth League ", Cfn 1In the participle that do not covered by An only have " doctrine ", and Cfn 2In participle " China ", " people " and " republic " of not covered by An connect together.According to constraint function 5,
Figure 142733DEST_PATH_IMAGE014
So, Cfn 1Priority ratio Cfn 2Priority high.
The length relation of constraint function 6:Fn and An
Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning.Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less.
The formal definition of constraint function 6 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Figure 367041DEST_PATH_IMAGE015
For example, An=" Computer Department of the Chinese Academy of Science ", Cfn 1=" Inst. of Computing Techn. Academia Sinica ", Cfn 2=" Inst. of Computing Techn. Academia Sinica's residential building ".According to constraint function 6,
Figure 287592DEST_PATH_IMAGE016
So, Cfn 1Priority ratio Cfn 2Priority high.
The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)
By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher.
The formal definition of constraint function 7 and being calculated as follows:
Figure 40784DEST_PATH_IMAGE017
For example, An=" lithium battery ", Cfn 1=" lithium ion battery ", Cfn 2=" lithium-ion-power cell, Freq (Cfn 1)=42, Freq (Cfn 2)=12, according to constraint function 7,
Figure 453311DEST_PATH_IMAGE018
So, Cfn 1Priority ratio Cfn 2Priority high.
When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN i, analyze FA(Cfn i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name.
4 following constraint functions are based on the definition of candidate's full name collection.
The word of constraint function 8:An is from the relative ratios among the Cfn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN, such as, the abbreviation of some external transliteration vocabulary does not just have identical word with full name, has carried out some synonyms when some abbreviation is reduced into full name and has transformed etc.
The formal definition of constraint function 8 and being calculated as follows:
Figure 809206DEST_PATH_IMAGE019
For example, An=" acquired immune deficiency syndrome (AIDS) ", Cfn 1=" aids ", Cfn 2=" acquired immunodeficiency syndrome " is although An and Cfn 1There is not identical word, but An and Cfn 2There is not identical word, so can not be because of Cfn yet 1The value of function 1 be 0 just to judge Cfn 1It or not full name.
The relative coverage ratio that constraint function 9:Fn concentrates at candidate's full name
Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, such as, some abbreviation is not high to the coverage rate of candidate's full name, and the priority of candidate's full name that coverage rate is relatively high so is higher.
The formal definition of constraint function 9 and being calculated as follows:
Figure 990789DEST_PATH_IMAGE020
For example, An=" Tsing Hua Tong Fang ", Cfn 1=" Tsing-Hua University/with side/share/limited/company ", Cfn 2=" Tsing-Hua University/with side/CD/share/limited/company ", although An is to Cfn 1And Cfn 2Word-coverage rate not high, but to Cfn 1Word-coverage rate relatively higher, so Cfn 1Compare Cfn 2It is high that priority is wanted.
The frequency that constraint function 10:Fn concentrates at candidate's full name
When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher.
The formal definition of constraint function 10 and being calculated as follows:
For example, An=" eel connection ", Cfn 1=" world's eel vegetative propagation joint conference ", Cfn 2=" Shantou eel community of stock part company limited ", Freq (Cfn 1)=3, Freq (Cfn 2Although)=1 is according to constraint function 7, Cfn 1And Cfn 2Frequency all lower, but according to constraint function 10, Cfn 1And Cfn 2The frequency of concentrating at candidate's full name is all higher.
Constraint function 11: the element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein
When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low.
The formal definition of constraint function 11 and being calculated as follows:
Figure 408181DEST_PATH_IMAGE022
The importance of candidate's full name that the value of constraint function 11 is lower is lower.
More than the concrete meaning of the constraint function constraint function concentrated from 11 aspects be illustrated, they have represented Fn(or Cfn quantitatively) and An between constraint, axiom of constraint then represents Fn(or Cfn qualitatively) and An between constraint, the below is specifically described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Form represents:
Figure 974291DEST_PATH_IMAGE023
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An.
Axiom of constraint 2: indicative mood axiom
Form represents:
How do not comprise interrogative " what ", " ", " what " etc. among meaning: Fn directly perceived and the An.
Axiom of constraint 3: form does not repeat axiom
Form represents:
Figure 105375DEST_PATH_IMAGE025
Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string.
For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet, Jade Flowery Islet ", Cfn is the ss form, s=" Jade Flowery Islet " wherein is so Cfn should be modified to s.This phenomenon why can occur is because do not have punctuation mark to separate between two " Jade Flowery Islets " in the language material.
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Figure 85970DEST_PATH_IMAGE026
Meaning: Fn directly perceived semantically can not repeat.
For example, An=" Hainan Island ", Cfn=" Jade Flowery Islet Hainan Island ", Cfn is the s1s2 form, and s1=" Jade Flowery Islet " wherein, s2=" Hainan Island " is so Cfn is incorrect.This phenomenon why can occur is because of not having punctuation mark to separate between s1 in language material and the s2.
Axiom of constraint 5: entirely be called for short axiom of equal value
Form represents:
Figure 190192DEST_PATH_IMAGE027
Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated.
Axiom of constraint 5 is not used in the checking to full abbreviation relation, and is used for the expansion to full abbreviation relational knowledge base.
In that the full abbreviation relation constraint of the present invention's definition has been done on the basis that describes in detail, with reference to figure 1, specifically introduce the embodiment of the inventive method.
Method according to Chinese abbreviation identification Chinese full name of the present invention comprises two large steps, is respectively to produce candidate's full name collection and candidate's full name collection is carried out aftertreatment, and the below describes them respectively.Because utilize the method for query pattern 1 and query pattern 2 generation candidate full name collection different, so separate introduction.
As shown in Figure 2, utilize the specific implementation step of query pattern 1 generation candidate full name collection as follows:
Step 1-1, user input known Chinese abbreviation An;
Step 1-2, according to query pattern 1: " being called for short An ", construct concrete query term.
Step 1-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 1-4, by regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged.Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction.Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin.The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides.
Step 1-5, utilize algorithm FCFNEA to extract benchmark candidate full name collection.
Extract the algorithm of benchmark candidate full name collection: (formal candidate fullname extract algorithm FCFNEA)
Input: label is called for short the sentence set entirely to type
Figure 815208DEST_PATH_IMAGE028
, entirely be called for short the sentence set without the suffix type
Figure 205738DEST_PATH_IMAGE029
, have the suffix type entirely to be called for short the sentence set
Figure 814574DEST_PATH_IMAGE030
Output: benchmark candidate full name set
Figure 456908DEST_PATH_IMAGE031
Step1: , extract the entry of label centering
Figure 29021DEST_PATH_IMAGE033
à
Figure 125153DEST_PATH_IMAGE031
, and statistics
Figure 40019DEST_PATH_IMAGE034
Frequency;
Step2:
Figure 767191DEST_PATH_IMAGE035
,
Figure 906049DEST_PATH_IMAGE036
If,
Figure 958318DEST_PATH_IMAGE033
Be included in
Figure 208034DEST_PATH_IMAGE037
In, then
Figure 786783DEST_PATH_IMAGE033
Frequency+1, and from
Figure 96542DEST_PATH_IMAGE029
Middle deletion ;
Step3: ,
Figure 122769DEST_PATH_IMAGE036
If,
Figure 72271DEST_PATH_IMAGE033
Be included in
Figure 364712DEST_PATH_IMAGE037
In, then
Figure 815285DEST_PATH_IMAGE033
Frequency+1;
Step4:
Figure 509571DEST_PATH_IMAGE039
, utilize ICTCLAS to carry out participle, with first participle
Figure 629974DEST_PATH_IMAGE040
With last participle
Figure 268766DEST_PATH_IMAGE041
Form
Figure 663975DEST_PATH_IMAGE042
,
Figure 681610DEST_PATH_IMAGE042
à
Figure 769651DEST_PATH_IMAGE043
Step5:
Figure 630160DEST_PATH_IMAGE035
,
Figure 32322DEST_PATH_IMAGE044
If, The middle prefix that exists is that pre and suffix are the entry of suf
Figure 553620DEST_PATH_IMAGE045
, then à , from Middle deletion
Figure 794928DEST_PATH_IMAGE047
, utilize prioritization strategy P SCFObtain
Figure 505395DEST_PATH_IMAGE046
Best candidate
Figure 46098DEST_PATH_IMAGE048
à
Figure 14535DEST_PATH_IMAGE031
;
Step6:?return
Figure 84122DEST_PATH_IMAGE049
The prioritization strategy of in the Step5 of algorithm FCFNEA, using PSCFBe defined as follows:
Prioritization strategy (priority sort comparison function PSCF)
Figure 547465DEST_PATH_IMAGE050
,
Figure 485334DEST_PATH_IMAGE051
?iff (???)
1).?
Figure 983311DEST_PATH_IMAGE052
;
2).
if?
Figure 174438DEST_PATH_IMAGE054
;
Figure 650419DEST_PATH_IMAGE055
?iff
1). ;
2).
Figure 148713DEST_PATH_IMAGE056
;
If
Figure 180123DEST_PATH_IMAGE057
, then claim Cfn kBe
Figure 866319DEST_PATH_IMAGE058
In best candidate, be designated as
Step 1-6, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection.
Extract the algorithm of non-benchmark candidate full name: (informal candidate fullname extract algorithm ICFNEA)
Input: phrase to be extracted or short sentence Co-referent, the known concept word Inputitem={C 1 C 2 C n };
Output: the full abbreviation candidate who extracts Candidate;
Step1:Right Co-referentCarry out participle and mark part of speech, word segmentation result is: { P 1 P 2 P m }
The definition position variable Left_flag k, Left1
Step3:?for?each? C i ∈{C n C n-1 ……C 1 }
for?each?Pj∈?{Pleft_flagPleft_flag-1……P1}
If Ci appears among the Pj
Then left_flag j
break;
end?if
end?for?each
end?for?each
Step4:?for?each? P k ∈{P 1 P 2 ……P m }
If P k Part of speech { conjunction preposition auxiliary word verb measure word label } and kLeft_flag
Then left k+1 ;
end?if
end?for?each
Step5:return? ? Candidate? {P left ……P m };
Border, the left and right sides decided again in candidate's full name that step 1-7, the method for utilizing analogy are concentrated non-benchmark candidate full name.
The method of analogy is specifically seen following method 1 and method 2.
Method 1: form represents:
Figure 655601DEST_PATH_IMAGE060
Figure 174307DEST_PATH_IMAGE061
Figure 867456DEST_PATH_IMAGE062
Figure 194533DEST_PATH_IMAGE063
Figure 72359DEST_PATH_IMAGE064
The meaning directly perceived of method 1: for concentrated any two candidates of candidate's full name
Figure 219306DEST_PATH_IMAGE065
With
Figure 716147DEST_PATH_IMAGE004
If satisfy simultaneously precondition:
1) Chinese character among the An all appears at
Figure 491205DEST_PATH_IMAGE065
In
2)
Figure 946457DEST_PATH_IMAGE065
Be True substring
3) Frequency 2 or
Figure 510796DEST_PATH_IMAGE004
Frequency<10
4) With respect to
Figure 727331DEST_PATH_IMAGE065
Prefix be not the prefix that all the other candidates concentrated in candidate's full name
Then
Figure 693537DEST_PATH_IMAGE065
Frequency change into
Figure 849712DEST_PATH_IMAGE065
With
Figure 115609DEST_PATH_IMAGE004
The frequency sum, and will Concentrate deletion from candidate's full name.
Method 2: form represents:
Figure 225833DEST_PATH_IMAGE066
Figure 705356DEST_PATH_IMAGE067
Figure 673312DEST_PATH_IMAGE068
Figure 362919DEST_PATH_IMAGE069
The meaning directly perceived of method 2: for concentrated any two candidates of candidate's full name
Figure 543365DEST_PATH_IMAGE065
With
Figure 674132DEST_PATH_IMAGE004
If satisfy simultaneously precondition:
1)
Figure 406464DEST_PATH_IMAGE065
Frequency
Figure 724313DEST_PATH_IMAGE070
10
2) Frequency
Figure 693723DEST_PATH_IMAGE071
5 times of frequency
3)
Figure 596957DEST_PATH_IMAGE065
Be True substring
4)
Figure 924351DEST_PATH_IMAGE065
In comprise An number of words and
Figure 357606DEST_PATH_IMAGE004
In comprise An number of words equate
Then
Figure 838266DEST_PATH_IMAGE065
Frequency change into
Figure 865128DEST_PATH_IMAGE065
With
Figure 456646DEST_PATH_IMAGE004
The frequency sum, and will
Figure 9987DEST_PATH_IMAGE004
Concentrate deletion from candidate's full name.
Step 1-8, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated.Specific algorithm is as follows:
Utilize LBV and RBV to decide again the algorithm (RDLRB) on border, the left and right sides:
Input: candidate's full name Cfn, be called for short An, the left margin vocabulary LBV, the right margin vocabulary RBV
Output: decide again the candidate's full name behind the border, the left and right sides CFN
Utilize ICTCLAS pair CfnCarry out participle, the result is: Cfn _ clas= P 1 P 2 P n
Step2:Determine AnFirst character and the last character exist CfnThe middle respectively participle of correspondence P i With P j
Definition CfnLeft margin Left1;
for?each? P k ∈{P i-1 ……P 1 }
If P k In the on the left side circle vocabulary
left k+1;
break;
end?if
end?for?each
Step4:Definition CfnRight margin RightN;
for?each? P k ∈{P j+1 ……P n }
If P k On the right in boundary's vocabulary
right k-1;
break;
end?if
end?for?each
Step5:? cFN?? {P left ……P right };
Return cFN;
Step 1-9, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary.
In above-mentioned steps 1-9, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:
Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj.
Definition 2: each the candidate's full name Cfnk that concentrates for benchmark candidate full name, if each word of An appears among the Cfnk, then Cfnk is carried out participle with ICTCLAS after
Figure 130390DEST_PATH_IMAGE072
, establish first character that participle Fi and Fj are respectively An and the last character corresponding participle in Cfnk, note
Figure 910127DEST_PATH_IMAGE073
The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk,
Figure 898812DEST_PATH_IMAGE074
The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk.
Definition 3: suspicious left margin vocabulary (dubious left boundary vocabulary DLBV):
Formal definition:
map<key,map_value>?dubious_left_boundary;
Key:string prefix: recently left part word or recently prefix word
Map_value:int qu:prefix is as the frequency of nearest left part word
Int liu:prefix is as the frequency of nearest prefix word
Whether bool flag:prefix needs is manually verified
Definition 4: suspicious right margin vocabulary (dubious right boundary vocabulary DRBV):
map<key,map_value>?dubious_left_boundary;
Key:string suffix: recently right part word or recently suffix word
Map_value:int qu:suffix is as the frequency of nearest right part word
Int liu:suffix is as the frequency of nearest suffix word
Whether bool flag:suffix needs is manually verified
Step 1-10, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary.
The method of the suspicious left margin vocabulary of artificial checking is as follows:
Figure 713184DEST_PATH_IMAGE075
If satisfy:
1) prefix manually verifies
2) prefix is as the frequency of nearest left part word〉2
3) frequency of the nearest prefix word of prefix conduct<2
4) prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word
Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word.
Definition 5: left margin vocabulary (left boundary vocabulary LBV):
Formal definition:
map<key,map_value>?left_boundary;
Key:string prefix: left margin word
Map_value:int num: utilize prefix to determine the Cfn number of left margin
The method of the suspicious right margin vocabulary of artificial checking is as follows:
Figure 4488DEST_PATH_IMAGE076
If satisfy:
1) suffix manually verifies
2) suffix is as the frequency of nearest right part word〉2
3) frequency of the nearest suffix word of suffix conduct<2
4) suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word
Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary.
Definition 6: right margin vocabulary (right boundary vocabulary RBV):
map<key,map_value> right_boundary_cfn;
Key:string suffix: right margin word
Map_value:int num: utilize suffix to determine the Cfn number of right margin
Step 1-11, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection.
We do experiment with 4000 Chinese An, wherein account for 88.75% with what query pattern 1 can obtain the source language material, account for 24.76% with what query pattern 2 can obtain the source language material, query pattern 2 can obtain the source language material and only account for 2.33% with what query pattern 1 can not obtain the source language material, so, 2 of query patterns are namely only just used query pattern 2 as the replenishing of query pattern 1 when query pattern 1 obtains less than candidate's full name in the present invention.
As shown in Figure 3, utilize the specific implementation step of query pattern 2 generation candidate primitive collection as follows:
Step 2-1, user input known Chinese abbreviation An;
Step 2-2, according to query pattern 2: " An full name ", construct concrete query term.
Step 2-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 2-4, by the structure regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 2-5, utilize algorithm CFNEA to extract candidate's full name collection.
Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA)
Input: prefix Prefix, known abbreviation? Inputitem, phrase to be extracted or short sentence Co-referent
Output: the full abbreviation candidate who extracts Candidate
Step1: Defined label Flag0, (that increases income seemingly can not be used for commercial object) is right Co-referentParticiple is designated as: { P 1 P 2 P n }
Step2: for?each?P i∈? {P 1 P 2 ……P n }
If Flag=0 and P iWith PrefixIdentical word and P is arranged iWith InputitemWithout identical word
Then flag?1 ;
end?if
If Flag=1And P i With PrefixWithout identical word
Then break;
end?if
If P i With InputitemIdentical word is arranged
Then break;
end?if
end?for?each
Step3: if flag=0 Then? ?i?0 ;
Step4: Candidate? {P i ……P n }
Return Candidate
Obtain candidate's full name collection by aforesaid operations, then candidate's full name collection is carried out aftertreatment, obtain final result, aftertreatment comprises to be verified, classifies and sort candidate's full name, and with reference to figure 4, its specific implementation step is as follows:
Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated.
Step C-2, generate the decision tree (see figure 5) by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, and removing classification is candidate's full name of " N ", and retention class is that candidate's full name of " Y " generates the full name collection.
In Fig. 5, the different font mistake of " N1 " expression low frequency, the different font mistake of " N2 " expression high frequency, the different order type of " N3 " expression low frequency mistake, " Y " expression is correct.
Step C-3, the full name collection carried out the classification of Constraint-based collection of functions.
According to full name whether different word or different order are arranged in the present invention, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type (seeing Table 1) according to An to the covering center of gravity of FN.
Form
Figure 873786DEST_PATH_IMAGE077
The type of full name
Figure 72686DEST_PATH_IMAGE078
The condition that concrete criteria for classification and all kinds of full name need to satisfy (seeing Table 2).
Form
Figure 210406DEST_PATH_IMAGE077
The criteria for classification of full name
Classification Need satisfied condition
The strong linguistic context of high frequency is irrelevant f 1=1
Figure 203770DEST_PATH_IMAGE079
f 2=1
Figure 551575DEST_PATH_IMAGE079
f 3=1
Figure 554166DEST_PATH_IMAGE079
f 11=1
The strong linguistic context of low frequency is irrelevant f 1=1
Figure 546392DEST_PATH_IMAGE079
f 2=1
Figure 304133DEST_PATH_IMAGE079
f 3=1
Figure 545758DEST_PATH_IMAGE079
f 11< 1
The weak linguistic context of high frequency is irrelevant f 1=1
Figure 555303DEST_PATH_IMAGE079
f 2=1
Figure 933194DEST_PATH_IMAGE079
0.823
Figure 127415DEST_PATH_IMAGE080
f 3<1
Figure 590758DEST_PATH_IMAGE079
f 9=1 f 11=1
The weak linguistic context of low frequency is irrelevant f 1=1 f 2=1
Figure 1513DEST_PATH_IMAGE079
0.823
Figure 420993DEST_PATH_IMAGE080
f 3<1
Figure 303499DEST_PATH_IMAGE079
f 9=1
Figure 249458DEST_PATH_IMAGE079
f 11<1
Forward direction type linguistic context is relevant f 1=1
Figure 660848DEST_PATH_IMAGE079
f 2=1
Figure 98782DEST_PATH_IMAGE079
f 3
Figure 378454DEST_PATH_IMAGE081
1
Figure 585444DEST_PATH_IMAGE079
f 4
Figure 167735DEST_PATH_IMAGE082
0.5
Type linguistic context placed in the middle is relevant f 1=1
Figure 92966DEST_PATH_IMAGE079
f 2=1
Figure 910749DEST_PATH_IMAGE083
0.5
Figure 706667DEST_PATH_IMAGE081
f 4
Figure 725439DEST_PATH_IMAGE081
0.5 (f 3
Figure 762369DEST_PATH_IMAGE081
0.823
Figure 412794DEST_PATH_IMAGE084
f 9
Figure 868046DEST_PATH_IMAGE081
1)
The backward type linguistic context is relevant f 1=1
Figure 95765DEST_PATH_IMAGE079
f 2=1 f 3
Figure 432385DEST_PATH_IMAGE081
1
Figure 652014DEST_PATH_IMAGE079
f 4
Figure 773554DEST_PATH_IMAGE085
0.5
Different order type f 1=1
Figure 612197DEST_PATH_IMAGE079
f 2=0 f 11=1
Different font f 1
Figure 158902DEST_PATH_IMAGE081
1
Figure 236579DEST_PATH_IMAGE086
f 7
Figure 144492DEST_PATH_IMAGE087
f 10
Figure 748649DEST_PATH_IMAGE088
f 7
Figure 716605DEST_PATH_IMAGE089
0.05
Figure 281578DEST_PATH_IMAGE079
f 9=1
Figure 586658DEST_PATH_IMAGE079
f 11=1))
Notice because linguistic context is the concept of a semantic level, whether linguistic context is relevant so be difficult to judge a FN with computer intelligence ground, the judgement that utilizes constraint function to be similar to from the word-building rule aspect among the present invention.
In the form 2, the meaning directly perceived that the strong linguistic context of high frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.
In the form 2, the meaning directly perceived that the strong linguistic context of low frequency is irrelevant: FN comprises all words among the An and keeps word order constant, and each participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of high frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the highest at full name.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of low frequency: FN comprises all words among the An and keeps word order constant, and the most of participle among the FN has correspondence in An, and FN concentrates frequency the not highest at full name.
In the form 2, the meaning directly perceived that forward direction type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the latter half of FN.
In the form 2, the irrelevant meaning directly perceived of type linguistic context placed in the middle: FN comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the FN is similar.
In the form 2, the meaning directly perceived that the backward type linguistic context is relevant: FN comprises all words among the An and keeps word order constant, and the participle that is omitted among the FN is mostly at the first half of FN.
In the form 2, the meaning directly perceived of different order type: FN comprises all words among the An but word order has change, and FN concentrates frequency the highest at full name.
In the form 2, the meaning directly perceived of different font: FN does not comprise all words among the An but the frequency of FN is very high or the relative frequency concentrated at full name is very high.
Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name.
The priority comprehensive function PRI (Cfn, An) that uses in step C-4 is defined as follows:
Wherein,
Figure 59544DEST_PATH_IMAGE091
,
Figure 642973DEST_PATH_IMAGE092
Be the weight that each function is taked when the comprehensive evaluation, F iWith
Figure 751743DEST_PATH_IMAGE092
Between corresponding relation see Table 3,
Figure 471437DEST_PATH_IMAGE092
Size obtain by experiment according to the degree of restraint of each function to full abbreviation relation:
Form
Numbering The function content The function weight
F 1 The word of An is from the ratio among the Fn 0.12
F 2 The word order of Fn and An 0.08
F 3 An is to the word-coverage rate of Fn 0.06
F 4 An covers center of gravity to the participle of Fn 0.08
F 5 The longest continuative participle number that is not covered by An among the Fn 0.04
F 6 The length relation of Fn and An 0.06
F 7 The frequency that Fn occurs in GoogleArchSet (An) 0.10
F 8 The word of An is from the relative ratios among the Cfn 0.12
F 9 The relative coverage ratio that Fn concentrates at candidate's full name 0.10
F 10 The frequency that Fn concentrates at candidate's full name 0.12
F 11 The element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein 0.14
For actual effect of the present invention is described, adopt method of the present invention to look for full name to do great many of experiments to multidisciplinary abbreviation.We have randomly drawed 3910 Chinese An from multidisciplinary, utilize the present invention to search its Fn, the results are shown in form 4.
Form
Figure 967643DEST_PATH_IMAGE077
An searches the experimental result of FN
The An number Get access to the An number of Fn Get access to the number percent of the An of Fn The number of all Fn Search the exact rate (sampling) of Fn
3910 3561 91.07% 9305 94.77%
We have randomly drawed 3188 full name and have verified with decision tree from above-mentioned experiment, table 5 is results of decision tree checking.
Form
Figure 276265DEST_PATH_IMAGE077
The result of decision tree
Figure 756925DEST_PATH_IMAGE093
Can draw the following conclusions by experiment: the present invention has preferably recognition effect to the identification of Chinese full name, and is applied widely, can finely remedy the defective of the upper previous methods of Chinese full name identification.

Claims (10)

1. method of obtaining the Chinese full name from the Web webpage is characterized in that: comprise step:
Step 1, given Chinese abbreviation of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EFN and extract candidate's full name from full abbreviation language materials, form the set of candidate's full name;
Step 5, checking based on full abbreviation relation constraint is carried out in candidate's full name set, formed the full name set;
Step 6, classification based on full abbreviation relation constraint is carried out in full name set, thereby formed the full name set with the classification mark.
2. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 2, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
3. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 2, described query pattern comprises two kinds: query pattern 1: " being called for short An ", query pattern 2: " An full name "; Select first query pattern 1, next selects query pattern 2.
4. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in the above-mentioned steps 4, full name extraction algorithm EFN comprises two algorithm CFNEA1 and CFNEA2, two kinds of query patterns in the corresponding step 2 of difference, namely when selecting query pattern 1 in the step 2, adopt CFNEA1 to extract Fn in the step 4, when selecting query pattern 2 in the step 2, adopt CFNEA2 to extract Fn in the step 4.
5. A kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 1, step 4 was carried out following steps:
The full sentence that is called for short mainly is divided into three types, that is: label is to type, without the suffix type with the suffix type is arranged; Label is to type: the An back is without Chinese character, and Cfn is paired label and marks, and need not to determine the border of Cfn, directly extraction; Without the suffix type: the An back is without Chinese character, and Cfn is not paired label and marks, and Cfn need decide left margin; The suffix type is arranged: there is Chinese character the An back, shows that An is the first half of another abbreviation " An* ", so also this is the first half of full name " Cfn* " corresponding to " An* " to Cfn, so Cfn need determine border, the left and right sides;
Steps A-1, utilize algorithm FCFNEA to extract benchmark candidate full name collection;
Extract the algorithm of benchmark candidate full name collection: (formal candidate fullname extract algorithm FCFNEA)
Input: label is called for short the sentence set entirely to type
Figure 48403DEST_PATH_IMAGE001
, entirely be called for short the sentence set without the suffix type
Figure 660650DEST_PATH_IMAGE002
, have the suffix type entirely to be called for short the sentence set
Figure 542018DEST_PATH_IMAGE003
Output: benchmark candidate full name set
Figure 345075DEST_PATH_IMAGE005
, extract the entry of label centering à , and statistics Frequency;
Figure 364667DEST_PATH_IMAGE008
,
Figure 459662DEST_PATH_IMAGE009
If,
Figure 581201DEST_PATH_IMAGE006
Be included in
Figure 544478DEST_PATH_IMAGE010
In, then
Figure 700653DEST_PATH_IMAGE006
Frequency+1, and from
Figure 966549DEST_PATH_IMAGE002
Middle deletion
Figure 168860DEST_PATH_IMAGE010
;
Figure 76774DEST_PATH_IMAGE011
,
Figure 556297DEST_PATH_IMAGE009
If,
Figure 524253DEST_PATH_IMAGE006
Be included in
Figure 216790DEST_PATH_IMAGE010
In, then
Figure 928394DEST_PATH_IMAGE006
Frequency+1;
, utilize ICTCLAS to carry out participle, with first participle
Figure 994756DEST_PATH_IMAGE013
With last participle
Figure 578184DEST_PATH_IMAGE014
Form
Figure 562320DEST_PATH_IMAGE015
,
Figure 282015DEST_PATH_IMAGE015
à
Figure 185248DEST_PATH_IMAGE016
Figure 724814DEST_PATH_IMAGE008
,
Figure 778221DEST_PATH_IMAGE017
If,
Figure 211476DEST_PATH_IMAGE010
The middle prefix that exists is that pre and suffix are the entry of suf , then
Figure 718998DEST_PATH_IMAGE018
à
Figure 310516DEST_PATH_IMAGE019
, from
Figure 598278DEST_PATH_IMAGE002
Middle deletion
Figure 984260DEST_PATH_IMAGE020
, utilize prioritization strategy P SCFObtain
Figure 763997DEST_PATH_IMAGE019
Best candidate à
Figure 567054DEST_PATH_IMAGE004
;
return
Figure 858358DEST_PATH_IMAGE022
The prioritization strategy PSCF that uses in the Step5 of algorithm FCFNEA is defined as follows:
Prioritization strategy (priority sort comparison function PSCF)
Figure 125392DEST_PATH_IMAGE023
,
Figure 917767DEST_PATH_IMAGE024
?iff
1).?
Figure 55487DEST_PATH_IMAGE025
;
2).
Figure 48851DEST_PATH_IMAGE026
,if?
Figure 396656DEST_PATH_IMAGE027
;
Figure 133668DEST_PATH_IMAGE028
?iff
1). ;
2).
Figure 158003DEST_PATH_IMAGE029
;
If
Figure 399629DEST_PATH_IMAGE030
, then claim Cfn kBe
Figure 143594DEST_PATH_IMAGE031
In best candidate, be designated as
Figure 787065DEST_PATH_IMAGE032
Steps A-2, utilize algorithm ICFNEA to extract non-benchmark candidate full name collection;
Extract the algorithm of non-benchmark candidate full name: (informal candidate fullname extract algorithm ICFNEA)
Input: phrase to be extracted or short sentence Co-referent, the known concept word Inputitem={C 1 C 2 C n };
Output: the full abbreviation candidate who extracts Candidate;
Right Co-referentCarry out participle and mark part of speech, word segmentation result is: { P 1 P 2 P m }
The definition position variable Left_flag k, Left1
for?each? C i ∈{C n C n-1 ……C 1 }
for?each? P j ∈{P left_flag P left_flag-1 ……P 1 }
If C i Appear at P j In
Then left_flag? ?j
break;
end?if
end?for?each
end?for?each
for?each? P k ∈{P 1 P 2 ……P m }
If P k Part of speech { conjunction preposition auxiliary word verb measure word label } and kLeft_flag
Then left k+1 ;
end?if
end?for?each
return Candidate? {P left ……P m };
Border, the left and right sides decided again in candidate's full name that steps A-3, the method for utilizing analogy are concentrated non-benchmark candidate full name;
The method of analogy is specifically seen following method 1 and method 2;
Form represents:
Figure 715706DEST_PATH_IMAGE033
Figure 444628DEST_PATH_IMAGE034
Figure 992284DEST_PATH_IMAGE035
Figure 83737DEST_PATH_IMAGE036
Figure 855384DEST_PATH_IMAGE037
The meaning directly perceived of method 1: for concentrated any two candidates of candidate's full name
Figure 274864DEST_PATH_IMAGE038
With
Figure 157369DEST_PATH_IMAGE039
If satisfy simultaneously precondition:
Chinese character among the An all appears at In
Figure 45877DEST_PATH_IMAGE038
Be
Figure 952653DEST_PATH_IMAGE039
True substring
Figure 966745DEST_PATH_IMAGE038
Frequency 2 or
Figure 439315DEST_PATH_IMAGE039
Frequency<10
Figure 21606DEST_PATH_IMAGE039
With respect to
Figure 946837DEST_PATH_IMAGE038
Prefix be not the prefix that all the other candidates concentrated in candidate's full name
Then Frequency change into
Figure 294958DEST_PATH_IMAGE038
With The frequency sum, and will
Figure 585311DEST_PATH_IMAGE039
Concentrate deletion from candidate's full name;
Form represents:
Figure 613310DEST_PATH_IMAGE040
Figure 263734DEST_PATH_IMAGE041
Figure 46883DEST_PATH_IMAGE042
Figure 946705DEST_PATH_IMAGE043
The meaning directly perceived of method 2: for concentrated any two candidates of candidate's full name
Figure 981658DEST_PATH_IMAGE038
With If satisfy simultaneously precondition:
Figure 240305DEST_PATH_IMAGE038
Frequency 10
Figure 466067DEST_PATH_IMAGE038
Frequency
Figure 950138DEST_PATH_IMAGE045
5 times of frequency
Be
Figure 824870DEST_PATH_IMAGE039
True substring
Figure 998363DEST_PATH_IMAGE038
In comprise An number of words and
Figure 602519DEST_PATH_IMAGE039
In comprise An number of words equate
Then
Figure 570475DEST_PATH_IMAGE038
Frequency change into With
Figure 440528DEST_PATH_IMAGE039
The frequency sum, and will
Figure 305716DEST_PATH_IMAGE039
Concentrate deletion from candidate's full name;
Steps A-4, read in left margin vocabulary LBV and right margin vocabulary RBV respectively, border, the left and right sides decided again in the candidate's full name that utilizes LBV and RBV that non-benchmark candidate full name is concentrated; Specific algorithm is as follows:
Utilize LBV and RBV to decide again the algorithm (RDLRB) on border, the left and right sides:
Input: candidate's full name Cfn, be called for short An, the left margin vocabulary LBV, the right margin vocabulary RBV
Output: decide again the candidate's full name behind the border, the left and right sides CFN
Utilize ICTCLAS pair CfnCarry out participle, the result is: Cfn _ clas= P 1 P 2 P n
Determine AnFirst character and the last character exist CfnThe middle respectively participle of correspondence P i With P j
Definition CfnLeft margin Left1;
for?each? P k ∈{P i-1 ……P 1 }
If P k In the on the left side circle vocabulary
left? k+1;
break;
end?if
end?for?each
Definition CfnRight margin RightN;
for?each? P k ∈{P j+1 ……P n }
If P k On the right in boundary's vocabulary
right k-1;
break;
end?if
end?for?each
cFN?? {P left ……P right };
Return cFN;
Steps A-5, concentrate the nearest prefix word extract existence and nearest suffix word from the benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary; Concentrate nearest left part word and the nearest right part word that extracts existence from non-benchmark full name, join respectively in suspicious left margin vocabulary and the suspicious right margin vocabulary;
In above-mentioned steps A-5, the nearest prefix word of mentioning, nearest suffix word, nearest left part word, nearest right part word, suspicious left margin vocabulary, suspicious right margin vocabulary, specific definition and generation method are as follows:
Definition 1: for the Cfni and the Cfnj that satisfy above method 1 and method 2 conditionals, note Cfnj=left+Cfni+right, wherein, left (if not empty) is called the left part of Cfnj, right (if not empty) is called the right part of Cfnj, left and right are carried out participle with ICTCLAS respectively, and last participle of left is called the nearest left part word of Cfnj, and first participle of right is called the nearest right part word of Cfnj;
Definition 2: each the candidate's full name Cfnk that concentrates for benchmark candidate full name, if each word of An appears among the Cfnk, then Cfnk is carried out participle with ICTCLAS after
Figure 913415DEST_PATH_IMAGE046
, establish first character that participle Fi and Fj are respectively An and the last character corresponding participle in Cfnk, note
Figure 496843DEST_PATH_IMAGE047
The prefix that is called Cfnk, Fi-1 is called the nearest prefix word of Cfnk, The suffix that is called Cfnk, Fj+1 are called the nearest suffix word of Cfnk;
Definition 3: suspicious left margin vocabulary (dubious left boundary vocabulary DLBV):
Formal definition:
map<key,map_value>?dubious_left_boundary;
Key:string prefix: recently left part word or recently prefix word
Map_value:int qu:prefix is as the frequency of nearest left part word
Int liu:prefix is as the frequency of nearest prefix word
Whether bool flag:prefix needs is manually verified
Definition 4: suspicious right margin vocabulary (dubious right boundary vocabulary DRBV):
map<key,map_value>?dubious_left_boundary;
Key:string suffix: recently right part word or recently suffix word
Map_value:int qu:suffix is as the frequency of nearest right part word
Int liu:suffix is as the frequency of nearest suffix word
Whether bool flag:suffix needs is manually verified
Steps A-6, suspicious left margin vocabulary and suspicious right margin vocabulary are manually verified, generated left margin vocabulary and right margin vocabulary;
The method of the suspicious left margin vocabulary of artificial checking is as follows:
Figure 794149DEST_PATH_IMAGE049
If satisfy:
Prefix manually verifies
Prefix is as the frequency of nearest left part word〉2
The prefix conduct is the frequency of prefix word<2 recently
Prefix is as the nearest frequency of left part word〉5 * prefix are as the nearest frequency of prefix word
Then prefix is manually verified, determine whether as the left margin word, if then add the left margin vocabulary as the left margin word;
Definition 5: left margin vocabulary (left boundary vocabulary LBV):
Formal definition:
map<key,map_value>?left_boundary;
Key:string prefix: left margin word
Map_value:int num: utilize prefix to determine the Cfn number of left margin
The method of the suspicious right margin vocabulary of artificial checking is as follows:
If satisfy:
Suffix manually verifies
Suffix is as the frequency of nearest right part word〉2
The suffix conduct is the frequency of suffix word<2 recently
Suffix is as the nearest frequency of right part word〉9 * suffix are as the nearest frequency of suffix word
Then suffix is manually verified, determine whether it is the right margin word, if the right margin word then adds the right margin vocabulary;
Definition 6: right margin vocabulary (right boundary vocabulary RBV):
map<key,map_value> right_boundary_cfn;
Key:string suffix: right margin word
Map_value:int num: utilize suffix to determine the Cfn number of right margin
Steps A-7, merging benchmark candidate's full name collection and non-benchmark candidate full name collection generate candidate's full name collection;
Steps A-1 to steps A-2 forms algorithm CFNEA1.
6. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 2, step 4 was carried out following steps:
Step B-1, utilize algorithm CFNEA2 to extract candidate's full name collection;
Extract the algorithm of candidate's full name: (candidate fullname extract algorithm CFNEA2)
Input: prefix Prefix, known abbreviation? Inputitem, phrase to be extracted or short sentence Co-referent
Output: the full abbreviation candidate who extracts Candidate
Defined label Flag0, (that increases income seemingly can not be used for commercial object) is right Co-referentParticiple is designated as: { P 1 P 2 P n }
for?each?P i∈? {P 1 P 2 ……P n }
If Flag=0 and P iWith PrefixIdentical word and P is arranged iWith InputitemWithout identical word
Then flag?1 ;
end?if
If Flag=1And P i With PrefixWithout identical word
Then break;
end?if
If P i With InputitemIdentical word is arranged
Then break;
end?if
end?for?each
if flag=0 Then? ?i?0 ;
Candidate? {P i ……P n }
Return Candidate
Obtain candidate's full name collection by aforesaid operations.
7. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 1 is characterized in that: in described step 5, if the full name set is sky, and also have query pattern available in the step 2, then re-execute step 2-6; If full name set does not have alternative query pattern in the step 2 simultaneously for empty, then withdraw from, show can not from Web search the full name of given abbreviation.
8. a kind of method of from the Web webpage, obtaining the Chinese full name according to claim 1, it is characterized in that: in described step 5, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is the full name of object, and An is the abbreviation of object, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy; The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.
9. a kind of method of obtaining the Chinese full name from the Web webpage according to claim 8 is characterized in that: described step 5,6 specific implementation step are as follows:
Each candidate's full name that step C-1, the axiom of constraint 1-4 checking candidate full name that utilizes axiom of constraint to concentrate are concentrated;
Step C-2, generate decision tree by the constraint function collection, the candidate's full name that utilizes decision tree that candidate's full name is concentrated is classified, removing classification is candidate's full name of " F1 ", " F2 " and " F3 ", and retention class is candidate's full name of " T ", thereby generates the full name collection;
The different font mistake of " F1 " expression low frequency, the different font mistake of " F2 " expression high frequency, the different order type of " F3 " expression low frequency mistake, " Y " expression is correct;
Step C-3, the full name collection carried out the classification of Constraint-based collection of functions;
According to full name whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to FN at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type according to An to the covering center of gravity of FN;
The condition that concrete criteria for classification and all kinds of full name need to satisfy:
The meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;
The meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the An and keeps word order constant, and each participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;
The irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the highest at full name;
The irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the An and keeps word order constant, and the most of participle among the Fn has correspondence in An, and Fn concentrates frequency the not highest at full name;
The meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn;
The irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the An and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar;
The meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the An and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn;
The meaning directly perceived of different order type: Fn comprises all words among the An but word order has change, and Fn concentrates frequency the highest at full name;
The meaning directly perceived of different font: Fn does not comprise all words among the An but the frequency of Fn is very high or the relative frequency concentrated at full name is very high;
Step C-4, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort full name to sort to full name;
The priority comprehensive function PRI (Cfn, An) that uses in step C-4 is defined as follows:
Wherein,
Figure 821514DEST_PATH_IMAGE052
,
Figure 130135DEST_PATH_IMAGE053
The weight of taking when the comprehensive evaluation for each function.
10. require 8 or 9 described a kind of methods of obtaining the Chinese full name from the Web webpage according to claim, it is characterized in that: the concrete meaning of described constraint function collection is:
The word of constraint function 1:An is from the ratio among the Fn
Full name comprises and is called for short all included Chinese characters, and namely each Chinese character among the An comes among the Fn, concentrates at candidate's full name, and the priority that comprises the higher candidate's full name of the ratio of word of An is higher;
The formal definition of constraint function 1 and being calculated as follows:
The word order of constraint function 2:Fn and An
In the breviary process, most word orders that keeping in the full name that are called for short, the order of word is strictly arranged sequentially by what occur in Fn among the An;
The formal definition of constraint function 2 and being calculated as follows:
Fn is identical with the An word order, and all words that containing among the An all appear among the Fn, if the word that does not appear among the Fn is arranged among the An, then the value of constraint function 2 is 0;
Constraint function 3:An is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in abbreviation in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, the participle that candidate's full name is called for short covering is more, just more may become full name;
The formal definition of constraint function 3 and being calculated as follows:
Figure 822651DEST_PATH_IMAGE056
Constraint function 4:An covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name;
The formal definition of constraint function 4 and being calculated as follows:
Figure 516937DEST_PATH_IMAGE057
The longest continuative participle number that is not covered by An among the constraint function 5:Fn
Candidate's full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller;
The formal definition of constraint function 5 and being calculated as follows:
Figure 761974DEST_PATH_IMAGE058
Wherein, N represents the not number of capped participle string contained among the Fn;
The length relation of constraint function 6:Fn and An
Usually the abbreviation of standard can excessively not reduce, and can see that to guarantee majority name knows meaning; Thereby most be called for short corresponding full name length in a scope, generally at the 1.5-5 that is called for short length doubly, the probability that full name length exceeds this scope is less;
The formal definition of constraint function 6 and being calculated as follows:
Figure 807290DEST_PATH_IMAGE059
The frequency that constraint function 7:Fn occurs in GoogleArchSet (An)
By being called for short when searching full name to the Google, the priority of candidate's full name that occurrence frequency is higher in GoogleArchSet (An) is higher;
The formal definition of constraint function 7 and being calculated as follows:
Figure 671341DEST_PATH_IMAGE060
When searching Fn by An, obtain sometimes several candidate's full name, they consist of candidate's full name collection Set_CFN, for any one the candidate's full name Cfn among the Set_CFN i, analyze FA(Cfn i, can analogy Set_CFN in the time of An) in the desired value of other candidate's full name;
4 following constraint functions are based on the definition of candidate's full name collection:
The word of constraint function 8:An is from the relative ratios among the Cfn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's full name in Set_CFN;
The formal definition of constraint function 8 and being calculated as follows:
Figure 810680DEST_PATH_IMAGE061
The relative coverage ratio that constraint function 9:Fn concentrates at candidate's full name
Compare with constraint function 3, constraint function 9 is emphasized the relativity of candidate's full name in Set_CFN, if some abbreviation is not high to the coverage rate of candidate's full name, the priority of candidate's full name that coverage rate is relatively high so is higher;
The formal definition of constraint function 9 and being calculated as follows:
Figure 898722DEST_PATH_IMAGE062
The frequency that constraint function 10:Fn concentrates at candidate's full name
When searching Fn by An, sometimes candidate's full name concentrates the frequency of all candidate's full name all very low, and the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency of each candidate's full name, concentrate at candidate's full name, the priority of candidate's full name that frequency is relatively high is higher;
The formal definition of constraint function 10 and being calculated as follows:
Constraint function 11: the element that candidate's full name is concentrated according to the frequency ascending sort after, Fn relative position therein
When the concentrated element of candidate's full name was many, the candidate's that frequency is lower importance was relatively low;
The formal definition of constraint function 11 and being calculated as follows:
Figure 833497DEST_PATH_IMAGE064
The importance of candidate's full name that the value of constraint function 11 is lower is lower;
The concrete meaning of described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Form represents:
Figure 830271DEST_PATH_IMAGE065
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of An;
Axiom of constraint 2: indicative mood axiom
Form represents:
Figure 558056DEST_PATH_IMAGE066
Do not comprise interrogative among meaning: Fn directly perceived and the An;
Axiom of constraint 3: form does not repeat axiom
Form represents:
Figure 905861DEST_PATH_IMAGE067
Meaning directly perceived: be called for short in the relation complete, Fn and An cannot be the Chinese character strings of ss form, and wherein s is Chinese character string;
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Figure 642873DEST_PATH_IMAGE068
Meaning: Fn directly perceived semantically can not repeat;
Axiom of constraint 5: entirely be called for short axiom of equal value
Form represents:
Figure 635099DEST_PATH_IMAGE069
Meaning directly perceived: be called for short in the relation complete, the inevitable full name at An of Fn is concentrated, and the inevitable abbreviation at Fn of An is concentrated;
Axiom of constraint 5 is not used in the checking to full abbreviation relation, and is used for the expansion to full abbreviation relational knowledge base.
CN2011102531001A 2011-08-31 2011-08-31 Method for acquiring full names in Chinese from Web page Pending CN102955818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102531001A CN102955818A (en) 2011-08-31 2011-08-31 Method for acquiring full names in Chinese from Web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102531001A CN102955818A (en) 2011-08-31 2011-08-31 Method for acquiring full names in Chinese from Web page

Publications (1)

Publication Number Publication Date
CN102955818A true CN102955818A (en) 2013-03-06

Family

ID=47764629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102531001A Pending CN102955818A (en) 2011-08-31 2011-08-31 Method for acquiring full names in Chinese from Web page

Country Status (1)

Country Link
CN (1) CN102955818A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction
CN108460016A (en) * 2018-02-09 2018-08-28 中云开源数据技术(上海)有限公司 A kind of entity name analysis recognition method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1607493A (en) * 2003-09-24 2005-04-20 王子尧 Chinese character unit whole tone code fetch input method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1607493A (en) * 2003-09-24 2005-04-20 王子尧 Chinese character unit whole tone code fetch input method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI-XING XIE EL AT.: "《EXTRACTING CHINESE ABBREVIATION-DEFINITION PAIRS FROM ANCHOR TEXTS》", 《IEEE:PROCEEDINGS OF THE 2011 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS》 *
谢丽星等: "《基于用户查询日志和锚文字的汉语缩略语识别》", 《中国计算机语言学研究前沿进展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction
CN108460016A (en) * 2018-02-09 2018-08-28 中云开源数据技术(上海)有限公司 A kind of entity name analysis recognition method

Similar Documents

Publication Publication Date Title
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN105426539B (en) A kind of lucene Chinese word cutting method based on dictionary
CN103500160B (en) A kind of syntactic analysis method based on the semantic String matching that slides
CN104008092B (en) Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping
CN102651003B (en) Cross-language searching method and device
CN101118538B (en) Method and system for recognizing feature lexical item in Chinese naming entity
CN103235774A (en) Extraction method of feature words of science and technology project application form
CN103699529A (en) Method and device for fusing machine translation systems by aid of word sense disambiguation
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN105138514A (en) Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
CN108549625B (en) Chinese chapter expression theme analysis method based on syntactic object clustering
CN105808711A (en) System and method for generating model based on semantic text concept
CN102929902A (en) Character splitting method and device based on Chinese retrieval
Zhang et al. Rule-based extraction of spatial relations in natural language text
CN109614620A (en) A kind of graph model Word sense disambiguation method and system based on HowNet
CN104391837A (en) Intelligent grammatical analysis method based on case semantics
CN110390022A (en) A kind of professional knowledge map construction method of automation
CN104598441B (en) A kind of method that computer splits Chinese sentence
CN102955819A (en) Method for acquiring shortened form in Chinese from Web page
CN103258032A (en) Parallel webpage obtaining method and parallel webpage obtaining device
CN102955818A (en) Method for acquiring full names in Chinese from Web page
CN102982063A (en) Control method based on tuple elaboration of relation keywords extension
Rondon et al. Never-ending multiword expressions learning
Zamin et al. A statistical dictionary-based word alignment algorithm: An unsupervised approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130306