CN102955819A - Method for acquiring shortened form in Chinese from Web page - Google Patents

Method for acquiring shortened form in Chinese from Web page Download PDF

Info

Publication number
CN102955819A
CN102955819A CN2011102531213A CN201110253121A CN102955819A CN 102955819 A CN102955819 A CN 102955819A CN 2011102531213 A CN2011102531213 A CN 2011102531213A CN 201110253121 A CN201110253121 A CN 201110253121A CN 102955819 A CN102955819 A CN 102955819A
Authority
CN
China
Prior art keywords
short
called
candidate
abbreviation
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102531213A
Other languages
Chinese (zh)
Inventor
王石
丁远钧
符建辉
王卫民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd filed Critical KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN2011102531213A priority Critical patent/CN102955819A/en
Publication of CN102955819A publication Critical patent/CN102955819A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a method for acquiring a shortened form in Chinese from a Web page. The method comprises the steps of: inputting a known full name, selecting a query mode to establish a query item, submitting the query item to Google for acquiring an anchor text, then acquiring the corpus of the full names and the short forms from the anchor text, finally picking up candidate short forms by utilizing pick-up algorithms, and then sequencing the candidate short forms by utilizing the priority synthetic function, wherein three query modes are related, and the two corresponding pick-up algorithms for picking up short forms are used. The invention also defines the constraint of the relation between the full name and the short form, wherein the constraint includes a set of constraint axiom and a constraint function set, the constraint axiom qualitatively expresses the constraint between the full name and the short form, the constraint function set quantitatively expresses the constraint between the full name and the short form; moreover, a classification method for the full name and the short form is provided based on the constraint between the full name and the short form. The invention also defines a full name-short form relation graph and provides a joint testing method based on the full name- short form relation graph and the relation constraint between the full name and the short form.

Description

A kind of method of from the Web webpage, obtaining Chinese abbreviation
Technical field
The abbreviation that the present invention relates to Chinese information processing and information retrieval field obtains technology, relates in particular to a kind of method of obtaining Chinese abbreviation from the Web webpage, obtains the method for the Chinese abbreviation of multidisciplinary, extensive, high-accuracy from the Web webpage.
Background technology
Natural language processing is a major issue in computer science and the artificial intelligence field.Its research can realize carrying out with natural language between people and the computing machine various theories and the method for efficient communication.Widespread use along with computing machine and internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, the object of natural language processing is also processed from the small-scale restricted language and is turned to extensive real text to process, and its research will produce far-reaching influence to people's life.
Chinese information processing is to study how to utilize computing machine that Chinese information is processed automatically.Chinese is that a meaning is closed language, compares with western language, lacks explicit mark, and grammer, semanteme, pragmatic side are also more flexible, have increased the difficulty of computer understanding and processing, allow computing machine can process Chinese information, still have many difficulties to overcome.At present, Chinese information processing has obtained some achievements in fields such as speech recognition, participle, mechanical translation.The lifting of Chinese information robotization degree for the treatment of will bring considerable benefit to the science and technology of China, culture, economy, safety etc.
How quick from the bulk information of numerous and complicated Research into information retrieval is, the technology of Obtaining Accurate information needed.Information retrieval technique is through for many years development, and quite ripe at present, the novel information retrieval technique is just towards future developments such as intellectuality, mobilism, variation, personalizations.
Full name (Full Name, Fn) be complete address to title, be called for short (Abbreviation, An) to be brevity and lucidity in order expressing, and the address that obtains after the compression to be simplified in full name, if Fn and An have full abbreviation relation, claim that then Fn is the full name of An, An is the abbreviation of Fn, is denoted as FA(Fn, An).By full name to being called for short, can be regarded as the compression process of a quantity of information, by being called for short to full name, then can be regarded as the process of a decompress(ion), for example: c1=" Inst. of Computing Techn. Academia Sinica " is compressed, obtain c2=" institute is calculated by the Chinese Academy of Sciences ", again c2 is compressed, obtain c3=" Computer Department of the Chinese Academy of Science ", the c3 decompress(ion) is obtained c2, again the c2 decompress(ion) is obtained c1.Full name all is relative concept with being called for short, and such as in upper example, c2 is to be called for short with respect to c1, but is full name with respect to c3, says that separately c2 is full name or to be called for short all be nonsensical.
The full Relation acquisition that is called for short obtains (Knowledge Acquisition from Text as text knowledge, KAT) and information retrieval etc. use in a basic and crucial problem, its acquisition methods can be divided into two large classes: a class is based on the method for pattern, mainly utilize linguistics and natural language processing technique, extract relation schema by lexical analysis and grammatical analysis, then utilize pattern match to obtain full abbreviation relation, the method accuracy rate depends on linguistic knowledge and pattern base; The another kind of method that is based on statistics mainly based on corpus and statistical language model, is obtained full abbreviation relation by the degree of association of calculating between the concept, and the method accuracy rate and efficient are difficult to the real requirement that reaches desirable.The full problem of obtaining that is called for short relation again can be from two angles: one is the angle of excavating, and it is right to obtain full abbreviation exactly under the condition that does not have extraneous input; Another is the angle of searching, and known exactly full name looks for abbreviation or known abbreviation to look for full name.
" full name " mentioned among the present invention or " abbreviation " if no special instructions, all refer to Chinese full name or Chinese abbreviation.
Summary of the invention
For the limitation or the not high defective of accuracy rate that have in the existing full abbreviation Relation acquisition technology, the invention provides a kind of accuracy rate height and be applicable to multidisciplinary, ultra-large a kind of method of from the Web webpage, obtaining Chinese abbreviation.
In order to address the above problem, the invention provides a kind of method of from the Web webpage, obtaining Chinese abbreviation, comprise step:
Step 1, given Chinese full name Fn of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the full abbreviation relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EAN and extract candidate's abbreviation from full abbreviation language materials, form the candidate and are called for short set;
Step 5, the candidate is called for short set carries out classification based on full abbreviation relation constraint, thereby the candidate who forms with the classification mark is called for short set;
Step 6, the candidate is called for short set carries out based on full abbreviation relation constraint and entirely be called for short the joint verification of graph of a relation, be called for short set thereby form;
Step 7, abbreviation of the same type carries out prioritization in the set to being called for short, thereby forms the orderly abbreviation set with the classification mark.
In the technique scheme, in described step 2, described query pattern comprises three kinds: query pattern 1: " Fn abbreviation ", query pattern 2: " Fn* abbreviation ", query pattern 3: " full name Fn ".Query pattern 2 is the expansions to query pattern 1, and we have added one " * " between " Fn " and " abbreviation ", and " * " can mate any one word in the Google inquiry.Because tend to occur the language material of " sinus rhythm (hereinafter to be referred as hole rule) " and so in the webpage, this language material can't retrieve with query pattern 1, but utilizes query pattern 2 just can retrieve.We do experiment with 4000 Chinese Fn, wherein account for 64.65% with what query pattern 1 can get access to An, account for 61.18% with what query pattern 2 can get access to An, account for 21.02% with what query pattern 3 can get access to An, account for 82.51% with what query pattern 1 or query pattern 2 can get access to An, account for 84.10% with what query pattern 1,2,3 can get access to An.Therefore, in order to improve search efficiency, we preferentially select query pattern 1, secondly query pattern 2, at last query pattern 3.
In the technique scheme, in described step 4, be called for short extraction algorithm (EAN) and comprise two algorithm CAEA1 and CAEA2, when selecting query pattern 1 or query pattern 2 in the step 2, adopt CAEA1 to extract An in the step 4, when selecting query pattern 3 in the step 2, adopt CAEA2 to extract An in the step 4.
In the technique scheme, in described step 6, if be called for short set for empty, and also have query pattern available in the step 2, then re-execute step 2-7; If be called for short set for empty, do not have alternative query pattern in the step 2 simultaneously, then withdraw from, show can not from Web search the abbreviation of given full name.
In the technique scheme, in described step 6, entirely being called for short relation constraint is four-tuple R=(Fn, an An, F, A), wherein, Fn is full name, An is the abbreviation of Fn, and F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Hereinafter will further make an explanation to these two kinds of constraints.
In the technique scheme, in described step 6, entirely being called for short graph of a relation FAG (Fullname and Abbreviation Graph) is a four-tuple, i.e. FAG=(F, A, E, f), wherein,
Figure 832197DEST_PATH_IMAGE001
The full name collection,
Figure 141955DEST_PATH_IMAGE002
To be called for short collection, F
Figure 415942DEST_PATH_IMAGE003
A is vertex set,
Figure 469349DEST_PATH_IMAGE004
Be the nonoriented edge collection, f is that E is to F Mapping on the A, namely
Figure 665155DEST_PATH_IMAGE006
, always have the summit
Figure 692017DEST_PATH_IMAGE007
With , so that
Figure 712242DEST_PATH_IMAGE009
Set up, that is to say
Figure 832645DEST_PATH_IMAGE010
To connect
Figure 612382DEST_PATH_IMAGE011
With
Figure 476433DEST_PATH_IMAGE012
Nonoriented edge.
Beneficial effect: the present invention is the abbreviation that obtains its correspondence according to known full name from Web, namely obtain full abbreviation relation from the angle of searching, utilizing the schema-based method to come to obtain the candidate from Google is called for short, utilization comes candidate's abbreviation is verified based on the method for statistics, have multidisciplinary property, extensive, high accuracy for examination, and inquired into the classification that is called for short with computer realization, obtaining for the intelligence of extensive knowledge provides effective support.
Description of drawings
Fig. 1 is the full example that is called for short graph of a relation;
Fig. 2 utilizes query pattern 1 or query pattern 2 to obtain the process flow diagram of abbreviation;
Fig. 3 utilizes query pattern 3 to obtain the process flow diagram of abbreviation;
Fig. 4 is for being called for short the process flow diagram that collection carries out joint verification to the candidate;
Fig. 5 checking decision tree that the full type that is called for short and constraint collection of functions generate of serving as reasons.
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments:
Before method of the present invention is described, at first the formation rule and the word formation that are called for short in the full abbreviation relation are put in order and summed up.Be called for short in the relation complete, can be regarded as the compression process of a quantity of information to the process that is called for short by full name, in the compression process of quantity of information, sometimes have semantic equivalence conversion and the adjustment of word order, be divided into plain edition, different font and different order type so we will be called for short relation entirely.
Plain edition: each word in the abbreviation appears in the full name, and keeps their orders in full name, for example, and Fn=" People's Republic of China (PRC) ", An=" China ";
Different font: some word in the abbreviation does not occur in full name, has namely not only carried out the compression of quantity of information by full name to being called for short, and has also carried out semantic equivalence conversion, Fn=" Wa Huang Shengmumiao " for example, An=" Chinese mythology goddess mausoleum ";
Different order type: the order in the abbreviation between Chinese character is inconsistent with their orders of tie element in full name, for example, Fn=" Harbin the 6th pharmaceutical factory ", An=" breathes out medicine six factories ".
Below introduce in detail the complete relevant definition that is called for short graph of a relation and full abbreviation relation constraint.
To consisting of a bipartite graph, concrete grammar is by a collection of full abbreviation: all full name consist of the full name collection
Figure 290805DEST_PATH_IMAGE013
, all abbreviations consist of the abbreviation collection
Figure 582109DEST_PATH_IMAGE014
, the vertex set of F and A pie graph
Figure 317984DEST_PATH_IMAGE015
, Fn
Figure 654605DEST_PATH_IMAGE017
F
Figure 647968DEST_PATH_IMAGE016
An
Figure 871139DEST_PATH_IMAGE017
A if fn and an consist of a pair of full abbreviation, then constructs a nonoriented edge that connects fn and an.
In the present invention, defined full abbreviation graph of a relation and represented contact between Fn and the An, entirely being called for short graph of a relation FAG (Fullname and Abbreviation Graph) is a four-tuple, i.e. FAG=(F, A, E, f), wherein, The full name collection,
Figure 874746DEST_PATH_IMAGE002
To be called for short collection, F
Figure 507853DEST_PATH_IMAGE003
A is vertex set,
Figure 749478DEST_PATH_IMAGE004
Be the nonoriented edge collection, f is that E is to F
Figure 759023DEST_PATH_IMAGE005
Mapping on the A, namely
Figure 136914DEST_PATH_IMAGE006
, always have the summit With
Figure 669844DEST_PATH_IMAGE008
, so that
Figure 483079DEST_PATH_IMAGE009
Set up, that is to say
Figure 184319DEST_PATH_IMAGE010
To connect With
Figure 375446DEST_PATH_IMAGE012
Nonoriented edge.
Fig. 1 is full graph of a relation, wherein a full name collection of being called for short
Figure 257951DEST_PATH_IMAGE018
, be called for short collection
Figure 79277DEST_PATH_IMAGE019
Given full abbreviation graph of a relation FAG=(F, A, E, f),
Figure 21825DEST_PATH_IMAGE006
, the total existence With
Figure 83639DEST_PATH_IMAGE008
, so that
Figure 290629DEST_PATH_IMAGE009
, claim the summit
Figure 872920DEST_PATH_IMAGE011
With
Figure 798151DEST_PATH_IMAGE012
With the limit
Figure 491301DEST_PATH_IMAGE010
Association, the summit
Figure 287218DEST_PATH_IMAGE011
With Adjacent.
Given full abbreviation graph of a relation FAG=(F, A, E, f),
Figure 187358DEST_PATH_IMAGE020
, with
Figure 215357DEST_PATH_IMAGE021
All adjacent summits form
Figure 865781DEST_PATH_IMAGE021
Adjacent point set, be designated as Adj (
Figure 321033DEST_PATH_IMAGE022
), with
Figure 424119DEST_PATH_IMAGE021
The number on all adjacent summits is called
Figure 456141DEST_PATH_IMAGE021
The number of degrees, be designated as
Figure 757809DEST_PATH_IMAGE023
In the present invention, define full abbreviation relation constraint and represented constraint between Fn and the An, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is full name, and An is the abbreviation of Fn, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Before constraint function collection and axiom of constraint collection are elaborated, be listed in the basic symbol that hereinafter uses:
Fn represents full name;
An represents the abbreviation of Fn;
Can represents that the candidate of Fn is called for short;
The Google anchor text set of GoogleArchSet (Fn) expression Fn is when namely searching abbreviation corresponding to Fn from Google
The set of the front 100 anchor texts that return, if the anchor text that returns sum N is less than 100, then GoogleArchSet (Fn) only comprises only N bar anchor text;
The candidate of CanSet (Fn) expression Fn is called for short collection, candidate corresponding to Fn who namely extracts from GoogleArchSet (Fn)
Be called for short the set that forms;
The number that contained candidate is called for short among N_CanSet (Fn) the expression CanSet (Fn);
FnSet (Can) expression candidate is called for short full name collection corresponding to Can, and namely the candidate of each Fn among the FnSet (Can) is called for short
Concentrate and all contain Can;
The number of contained full name among N_FnSet (Can) the expression FnSet (Can);
FA (Fn, An) expression Fn and An have full abbreviation relation;
The number of contained Chinese character among length (str) the expression Chinese character string str;
N_word (Fn, An) expression appears at the Chinese character number among Fn and the An simultaneously;
Behind N_Clas (Fn) the expression Fn process participle, the participle number of appearance;
The participle number that is covered by An among N_Cover (Fn, An) the expression Fn;
The set of the participle that is covered by An among CoverSet (Fn, An) the expression Fn;
p i: i participle in the expression full name;
p 1/ p 2/ ... / p m: expression is by participle p 1, p 2P mThe segmentation sequence that forms, wherein/separation between the expression participle
Symbol;
The position of the participle central point of centre (Fn) expression Fn, after namely Fn passes through participle, the position of that middle participle
Put, or the mean place of those middle two participles, centre (Fn)=(N_Clas (Fn)+1)/2;
d i(Fn) i the participle p of expression Fn iCenter offset, i.e. the i of the position of the participle central point of Fn and Fn
Displacement between the position of participle, d i(Fn)=i-centre (Fn);
Figure 2011102531213100002DEST_PATH_IMAGE024
(Fn) the center of maximum side-play amount of expression Fn, i.e. the center offset ground maximal value of all participles of Fn,
Figure 852804DEST_PATH_IMAGE024
(Fn)=(N_Clas (Fn)-1)/2;
Len iI not capped contained participle number of participle string of (Fn, An) expression.After Fn carried out participle, do not covered by An
Those participles that arrive are capped the participle string if link then form in Fn, if do not link then independent bunchiness, i the capped contained participle number of participle string is designated as Len i(Fn, An);
The number of the An that freq (Fn, An) expression extracts from GoogleArchSet (Fn);
Figure 443186DEST_PATH_IMAGE025
Represent an infinitesimal number;
The frequency order of loca (Fn, Can) expression Can in CanSet (Fn) namely pressed the element among the CanSet (Fn)
After the big or small ascending sort of freq (Fn, Can), Can order therein;
Any Chinese character string among the S set et of NoInclude (s1, Set) expression Chinese character string is not the substring of Chinese character string s1;
How Interrogative represents interrogative set, comprises the interrogatives such as " what ", " ", " what ", " ";
Chinese character string after concat (s1, s2) represents Chinese character string s1 and Chinese character string s2 is connected;
The number of times that NumIn (s, c) expression Chinese character c occurs in Chinese character string s.
The below describes from 11 aspects to the concrete meaning that constraint function is concentrated:
The word of constraint function 1:Can is from the ratio among the Fn.
Generally speaking, full name comprises the candidate and is called for short all included Chinese characters.For example, Can=" Beijing University ", Fn=" Peking University ", each Chinese character among the Can comes among the Fn.Be called for short concentratedly the candidate, it is higher to appear at the priority that the higher candidate of the ratio of the word among the Fn is called for short.
The formal definition of constraint function 1 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Confucian Temple ", Can 1=" Confucian temple ", Can 2=" Confucian temple ".According to constraint function 1, have
Figure 281829DEST_PATH_IMAGE027
So, Can 1Priority ratio Can 2Priority high.
The word order of constraint function 2:Fn and Can.
In the breviary process, most candidates are called for short the word order that is keeping in the full name.For example, Fn=" Olympic Games ", Can=" Olympic Games ", the triliteral order among the Can is strictly arranged sequentially by what occur in Fn.
The formal definition of constraint function 2 be calculated as follows (indicate: this function is consistent with patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Figure 906845DEST_PATH_IMAGE028
Attention: Fn is identical with the Can word order, and all words that containing among the Can all appear among the Fn, if the word that does not appear among the Fn is arranged among the Can, then the value of constraint function 2 is 0.
Constraint function 3:Can is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in the candidate is called for short in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, it is more that the candidate is called for short the participle that covers full name, just more may become correct abbreviation.
The formal definition of constraint function 3 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Shanghai/traffic/university ", Can 1=" submitting large ", Can 2=" submitting ", according to constraint function 3,
Figure 781577DEST_PATH_IMAGE030
So, Cfn 1Priority ratio Cfn 2Priority high.
Constraint function 4:Can covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in the candidate is called for short, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name.For example, Can=" your boat group ", Fn=" China/Guizhou/aviation/industry/group/company ", abridged participle " China ", " industry ", " company " are respectively in forward part, center section and the rear section of Fn among the Fn.
The formal definition of constraint function 4 and being calculated as follows:
Figure 689490DEST_PATH_IMAGE031
Wherein,
Figure 169013DEST_PATH_IMAGE032
Corresponding
Figure 136969DEST_PATH_IMAGE033
For example, Fn=" China/Guizhou/aviation/industry/group/company ", Can 1=" your boat group ", Can 2=" your boat ", among the Fn by Can 1The participle that covers " Guizhou ", " aviation " and " group " are evenly distributed among the Fn, and among the Fn by Can 2The participle that covers " Guizhou " and " aviation " all are distributed in the first half of Fn.According to constraint function 4,
Figure 701943DEST_PATH_IMAGE034
So, Can 1Priority ratio Can 2Priority high.
The longest continuative participle number that is not covered by Can among the constraint function 5:Fn
The candidate is called for short usually and is comprised of a plurality of participles, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller.
The formal definition of constraint function 5 and being calculated as follows:
Figure 882388DEST_PATH_IMAGE035
Wherein, N represents the not number of capped participle string contained among the Fn
For example, Fn=" China/people/republic/common property/doctrine/Communist Youth League ", Can 1=" Chinese Communist Youth League ", Can 2=" Communist Youth League ", among the Fn not by Can 1The participle that covers only has " people " and " doctrine ", and among the Fn not by Can 2The participle that covers " China ", " people " and " republic " connect together.According to constraint function 5,
Figure 747576DEST_PATH_IMAGE036
So, Can 1Priority ratio Can 2Priority high.
The length relation of constraint function 6:Fn and Can
Usually the candidate of standard is called for short and can excessively reduce, and can see that to guarantee majority name knows meaning.Thereby most candidates are called for short corresponding full name length in a scope, the 1.5-5 that generally is called for short length the candidate doubly, the probability that full name length exceeds this scope is less.
The formal definition of constraint function 6 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Inst. of Computing Techn. Academia Sinica ", Can 1=" Computer Department of the Chinese Academy of Science ", Can 2=" calculating institute ".According to constraint function 6,
Figure 938703DEST_PATH_IMAGE038
So, Can 1Priority ratio Can 2Priority high.
The frequency that constraint function 7:Can occurs in GoogleArchSet (Fn)
Searched to the Google when being called for short by full name, the priority of candidate's abbreviation that occurrence frequency is higher among GoogleArchSet (Fn) is higher.
The formal definition of constraint function 7 and being calculated as follows:
Figure 922840DEST_PATH_IMAGE039
For example, Fn=" lithium ion battery ", Can 1=" lithium battery ", Can 2=" lithium electricity, Freq (Cfn 1)=42, Freq (Cfn 2)=12, according to constraint function 7, So, Can 1Priority ratio Can 2Priority high.
When searching An by Fn, obtain sometimes several candidates and be called for short, they consist of the candidate and are called for short collection CanSet (Fn), are called for short Can for any one candidate among the CanSet (Fn) i, analyze FA(Fn, Can i) time can analogy CanSet (Fn) in the desired value that is called for short of other candidate.
4 following constraint functions are based on the candidate and are called for short the collection definition.
The word of constraint function 8:Can is from the relative ratios among the Fn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's abbreviation in CanSet (Fn), such as, the abbreviation of some external transliteration vocabulary does not just have identical word with full name, has carried out some synonyms when some abbreviation is reduced into full name and has transformed etc.
The formal definition of constraint function 8 and being calculated as follows:
Figure 421134DEST_PATH_IMAGE041
For example, Fn=" Confucius Temple ", Can 1=" Confucian temple ", Can 2=" Confucian temple ", although
Figure 960700DEST_PATH_IMAGE042
Only have 0.5, still
Figure 14107DEST_PATH_IMAGE043
Also only have 0.5, so can not be because of Cfn 1The value of function 1 low just judge Cfn 1Not that correct candidate is called for short.
Constraint function 9: the candidate at Fn is called for short the relative coverage ratio of concentrating Fn
Compare with constraint function 3, constraint function 9 is emphasized the relativity of Can in CanSet (Fn), such as, some candidate is called for short not high to the coverage rate of full name, and the priority that the candidate that coverage rate is relatively high so is called for short is higher.
The formal definition of constraint function 9 and being calculated as follows:
For example, Fn=" Tsing-Hua University/with side/CD/share/limited/company ", Can 1=" Tsing Hua Tong Fang ", Can 2=" company of Tsing Hua Tong Fang " is although Can 1And Can 2Word-coverage rate to Fn is not high, but Cfn 1Word-coverage rate relatively higher, so Cfn 1Compare Cfn 2It is high that priority is wanted.
Constraint function 10:Can is called for short concentrated frequency the candidate
When searching Can by Fn, sometimes the candidate to be called for short the frequency of concentrating all candidates to be called for short all very low, the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency that each candidate is called for short, be called for short concentratedly the candidate, the priority that the relatively high candidate of frequency is called for short is higher.
The formal definition of constraint function 10 and being calculated as follows:
Figure 803388DEST_PATH_IMAGE045
For example, Fn=" office of development for poverty relief leading group of autonomous region ", Can 1=" office of poverty alleviation of autonomous region ", Can 2=" office of poverty alleviation " is although according to constraint function 7, Cfn 1And Cfn 2Frequency all lower, but according to constraint function 10, Cfn 1And Cfn 2It is all higher to be called for short concentrated frequency the candidate.
Constraint function 11: the candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein
When the candidate is called for short concentrated element when many, the candidate's that frequency is lower importance is relatively low.
The formal definition of constraint function 11 and being calculated as follows:
Figure 830250DEST_PATH_IMAGE046
The importance that the candidate that the value of constraint function 11 is lower is called for short is lower.
More than the concrete meaning of the constraint function constraint function concentrated from 11 aspects be illustrated, they have represented the constraint between Fn and the Can quantitatively, axiom of constraint then represents the constraint between Fn and the Can qualitatively, and the below is specifically described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Form represents:
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of Can.
Axiom of constraint 2: indicative mood axiom
Form represents:
Figure 584896DEST_PATH_IMAGE048
How do not comprise interrogative " what ", " ", " what " etc. among meaning: Fn directly perceived and the Can.
Axiom of constraint 3: form does not repeat axiom
Form represents:
Figure 967949DEST_PATH_IMAGE049
Meaning directly perceived: be called for short in the relation complete, Fn and Can cannot be the Chinese character strings of ss form, and wherein s is Chinese character string.
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Meaning directly perceived: the Chinese character that all appear among the Fn, the number of times that occurs in Fn must be not less than the number of times that occurs in Can.
For example, Fn=" Wa Huang Shengmumiao ", Can=" mausoleum, Chinese mythology goddess mausoleum " wherein appears at the Chinese character " mausoleum " among the Fn, has occurred twice in Can, and has only occurred once in Fn, so Can is incorrect.This phenomenon why can occur and be because in language material after the Can punctuation mark useless with hereinafter separate.
Axiom of constraint 5: do not make a general reference axiom
Form represents:
Figure 611737DEST_PATH_IMAGE051
Meaning directly perceived: the candidate is called for short corresponding full name should be less than or equal to 5.
For example, Can=" company " has the candidate of 24 Fn to be called for short concentrated have " company " 4000 that test full abbreviation centerings, is candidate's abbreviation of a general reference so this candidate is called for short, and the meaning of not obtaining is in this article given up this class candidate and is called for short.
The full abbreviation graph of a relation that defines in to the present invention and the full relation constraint that is called for short have been done on the basis that describes in detail, and lower mask body is introduced the embodiment of the inventive method.
The method of obtaining Chinese abbreviation according to the Chinese full name of the present invention comprises three large steps, is respectively to obtain the candidate and be called for short collection, the candidate who gets access to is called for short collection verifies and the result after the checking is done aftertreatment that the below describes them respectively.
Paper obtains the part that the candidate is called for short collection, because the structure of the anchor corpus that different query patterns gets access to is different, thereby cause extracting the specific algorithm difference that the candidate is called for short, again because query pattern 2 is the expansions to query pattern 1, so it is the same to utilize query pattern 1 and query pattern 2 to obtain the method that the candidate is called for short, but with to utilize query pattern 3 to obtain the method that the candidate is called for short different, below separate introduction.
As shown in Figure 2, utilizing query pattern 1 or query pattern 2 to produce candidates, to be called for short the specific implementation step of collection as follows:
Step 1-1, user input known Chinese full name Fn;
Step 1-2, according to query pattern 1: " Fn abbreviation " or query pattern 2: " Fn* abbreviation " constructs concrete query term.
Step 1-3, query term is submitted in the Google search engine searches for, N item anchor text is as the anchor language material before preserving.
Step 1-4, by regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 1-5, the candidate who utilizes algorithm CAEA1 to extract with tag from full abbreviation language material are called for short collection.
Step 1-6, utilize An right margin vocabulary to determine that again the candidate is called for short the right margin that concentrated candidate is called for short.
In above-mentioned step 1-1, also can input the document that comprises a collection of full name, want repeated execution of steps 1-2 to step 1-6 for each Fn in the document this moment, is called for short collection to obtain its corresponding candidate.
In above-mentioned step 1-3, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
In above-mentioned step 1-4, by analyzing full abbreviation language material, we find entirely to be called for short sentence certain structure, is divided into six types so will entirely be called for short sentence according to the difference of structure: half label type, rear portion somatotype, All-in-One type, label are to type, without prefix type with prefix type is arranged.The candidate who extracts from this full abbreviation sentence of six types is called for short, and its type is the corresponding full type that is called for short sentence.
Half label type: Yi Bian the right and left of Can only has matching symbol is arranged, illustrate that this sentence does not probably comprise complete An.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate (be called for short</em〉" height<b 〉.The reason that produces this mistake is intactly not obtain whole sentence when obtaining the anchor language material.
The rear portion somatotype: be called for short in the sentence complete, Fn is the rear section of another full name " * Fn ", so Can also is the rear section of abbreviation " * Can " corresponding to " * Fn ", because excessively reduction, Can probably is not the abbreviation of Fn.For example, utilize query pattern 1 inquiry Fn=" pleural effusion ", entirely be called for short sentence: suppurative<em pleural effusion (be called for short</em〉pyothorax).In upper full an abbreviation in the sentence, " pyothorax " is the abbreviation of " suppurative pleural effusion ", but because excessively reduction, " chest " is not the abbreviation of " pleural effusion ".The problem that does not have in some cases excessive reduction for example, is utilized query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely is called for short sentence: the People's Republic of China (PRC)<em〉Supreme People's Procuratorate (be called for short</em〉Chinese the Supreme People's Procuratorate).In upper full an abbreviation in the sentence, " Chinese the Supreme People's Procuratorate " is the abbreviation of " Supreme People's Procuratorate of the People's Republic of China (PRC) ", but wherein " the Supreme People's Procuratorate " also is the abbreviation of " Supreme People's Procuratorate ".So we need further research how to judge not excessively reduction.
All-in-One type: Fn composition as a whole occurs with other full name, and whole abbreviation is that the combination type of several full name is called for short.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence: the Supreme People's Court and<em the Supreme People's Procuratorate (be called for short</em〉two height).In upper full an abbreviation in the sentence, " Supreme People's Procuratorate " and " Supreme People's Court " form a whole, and " two height " is whole abbreviation.The structure of this language material have obvious characteristic a: Fn be have before whole decline and the Fn " with ", " with ", the conjunction such as " reaching ".
Label is to type: the Fn front is without Chinese character, and Can is paired symbol and marks, and need not to utilize algorithm to determine the border of Can, directly extraction again.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate (be called for short</em〉" the Supreme People's Procuratorate ").
Without prefix type: the Fn front is without Chinese character, and Can is not paired symbol and marks, and Can need not to determine left margin, but needs decide right margin.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate is called for short</em〉the Supreme People's Procuratorate is found in 1954.
Prefix type is arranged: there is Chinese character the Fn front, and Can need to determine left margin and right margin.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence: Jia Chunwang is elected as<em〉Supreme People's Procuratorate (be called for short</em〉the Supreme People's Procuratorate) chief procurator.
In above-mentioned step 1-5, the particular content of algorithm CAEA1 is as follows:
The candidate is called for short extraction algorithm 1: (candidate abbreviation extract algorithm CAEA1)
Input: entirely be called for short sentence Fa_sent
Output: the candidate of belt type mark is called for short Can
Step1: Will Fa_sentResolve into Before, FnWith Can_sentThree parts, wherein FnKnown full name, BeforeTo be positioned in full the abbreviation in the sentence FnThe Chinese character string of front, Can_sentAt the full Chinese character string that is positioned at " abbreviation " back in the sentence that is called for short. Can_sentWord list be shown Can_sent= P 1 P 2 P n , wherein P i Represent a Chinese character.Definition Can Can_sentIn left margin Left=1And right margin Right=n, definition CanType mark Tag=null
Step2: Can_sentThe left side is the pairing label AndThe right is not corresponding pairing label
Then TagHalf label type
end if
Step3: if before = null
if tag = null
Then TagWithout prefix type
endif
Turn step6
end if
if before!= null and tag = null
Then TagPrefix type is arranged
end if
Step4:If BeforeThe last character be " with " or " with " or " reaching "
thenfor each P i ∈{P 1 P 2 ……P n }
If P i Do not exist FnMiddle appearance
Then TagThe All-in-One type
Turn step5
end if
end for each
end if
Step5: for each P i ∈{P 1 P 2 ……P n }
If P i Do not exist FnMiddle appearance And P i BeforeMiddle appearance
then left i+1
end if
If P i FnMiddle appearance
break;
end if
end for each
if left>1
Then TagThe rear portion somatotype
end if
Step6:If Can_sentBy label to marking And Tag=without prefix type
Then TagLabel is to type
end if
Step7: for each P i ∈{P left P left+1 ……P n-1 }
If P i FnLast participle in occur And P I+1 Do not exist FnMiddle appearance
then right i
Will P i A word on the right joins in the An right margin vocabulary to be verified
end if
end for each
Step8: can P left P left+1 ……P right
Return can
In above-mentioned step 1-6, An right margin vocabulary is to be generated through artificial checking by An right margin vocabulary to be verified, in algorithm CAEA1 An right margin vocabulary to be verified is added dynamically.
As shown in Figure 3, utilizing query pattern 3 to produce candidates, to be called for short the specific implementation step of collection as follows:
Step 2-1, user input known Chinese full name Fn;
Step 2-2, according to query pattern 3: " full name Fn ", construct concrete query term.
Step 2-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 2-4, by the structure regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 2-5, utilize algorithm CAEA2 from full abbreviation language material, to extract the candidate to be called for short, to form the candidate and be called for short collection.
In above-mentioned step 2-1, also can input the document that comprises a collection of full name, want repeated execution of steps 2-2 to step 2-5 for each Fn in the document this moment, is called for short collection to obtain its corresponding candidate.
In above-mentioned step 2-3, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
In above-mentioned step 2-5, the particular content of algorithm CAEA2 is as follows:
The candidate is called for short extraction algorithm 2: (candidate abbreviation extract algorithm CAEA2)
Input: entirely be called for short sentence Fa_sent
Output: the candidate is called for short Can
Step1: Will Fa_sentResolve into Can_sent, FnWith BehindThree parts, wherein FnKnown full name, Can_sentAt the full Chinese character string that is positioned at " full name " front in the sentence that is called for short, BehindTo be positioned in full the abbreviation in the sentence FnThe Chinese character string of back.
Step2: Right Can_sentWith BehindDifference participle and mark part of speech, word segmentation result is: { P 1P 2P kAnd { R 1R 2R n, definition Can Can_sentIn one-level left margin subscript Left1=1, secondary left margin subscript Left2=1, the left margin subscript Left=1With the right margin subscript Right=kThe definition verb can intercept sign flag_v=0, and right margin can intercept sign flag_right=0 according to part of speech.
Step3: P i∈ {P 1P 2……P k}
IfP iWith fn identical word is arranged
ThenFlag_v 1; //P iVerb afterwards all cannot be as left margin
end if
IfP iWith fn identical word and is arranged Left2=1
ThenLeft2 i; // P iIt may be first participle of can
end if
IfP iPart of speech be " conjunction " or " preposition " or " auxiliary word "
then left1 i+1;
end if
IfP iPart of speech be " verb " and flag_v=0
then left1 i+1;
end if
end for each
Step4: for each P j∈ {P kP k-1……P 1}
IfP jWith fn identical word is arranged
ThenFlag_right 1; // Pj may be the participle of can
end if
IfP jPart of speech be " conjunction " or " preposition " or " auxiliary word " or " verb "
and flag_right = 0
then right j-1;
end if
IfP jWith behind identical word is arranged AndP jWith fn without identical word
then right j-1;
end if
IfP jBe punctuation mark
then right j-1;
end if
end for each
Step5: if left2 <= right
then left left2
end if
if left1 <= right
then left left1
end if
Step6: return can {P left……P right}
Obtain the candidate by aforesaid operations and be called for short collection, the below's discussion is called for short concentrated candidate's abbreviation to the candidate and verifies that with reference to figure 4, its specific implementation step is as follows:
Step 6-1, the axiom of constraint 1-5 checking candidate who utilizes axiom of constraint to concentrate are called for short each concentrated candidate and are called for short.
Step 6-2, the candidate is called for short concentrated candidate is called for short and carries out the classification of Constraint-based collection of functions.
Step 6-3, structure are called for short graph of a relation entirely, utilize full abbreviation graph of a relation that the candidate is called for short each concentrated candidate's abbreviation and verify.
Step 6-4, be called for short tag classification, class categories and the constraint function collection generates the decision tree (see figure 5) by the candidate, utilizing decision tree that the candidate is called for short concentrated candidate's abbreviation classifies, removing classification is candidate's abbreviation of " F ", and retention class is that the candidate of " T " is called for short.
In above-mentioned step 6-1, be called for short each concentrated candidate for the candidate and be called for short Can, whether checking Fn and Can satisfy the constraint requirements of axiom 1-4, if do not satisfy then this candidate's abbreviation is wrong.
In above-mentioned step 6-2, the concrete grammar of classification is as follows: according to being called for short whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to Fn at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type (seeing Table 1) according to An to the covering center of gravity of Fn.
The type that form 1 is called for short
Figure 894950DEST_PATH_IMAGE052
The condition that concrete criteria for classification and all kinds of abbreviation need to satisfy (seeing Table 2).
The criteria for classification that form 2 is called for short
Classification Need satisfied condition
The strong linguistic context of high frequency is irrelevant f 1=1 f 2=1
Figure 453288DEST_PATH_IMAGE053
f 3=1
Figure 652188DEST_PATH_IMAGE053
f 11=1
The strong linguistic context of low frequency is irrelevant f 1=1 f 2=1
Figure 783272DEST_PATH_IMAGE053
f 3=1
Figure 6443DEST_PATH_IMAGE053
f 11< 1
The weak linguistic context of high frequency is irrelevant f 1=1
Figure 212296DEST_PATH_IMAGE053
f 2=1
Figure 735681DEST_PATH_IMAGE053
0.823
Figure 634367DEST_PATH_IMAGE054
f 3<1
Figure 875993DEST_PATH_IMAGE053
f 9=1
Figure 619958DEST_PATH_IMAGE053
f 11=1
The weak linguistic context of low frequency is irrelevant f 1=1
Figure 732270DEST_PATH_IMAGE053
f 2=1
Figure 67437DEST_PATH_IMAGE053
0.823
Figure 265200DEST_PATH_IMAGE054
f 3<1
Figure 344014DEST_PATH_IMAGE053
f 9=1
Figure 310833DEST_PATH_IMAGE053
f 11<1
Forward direction type linguistic context is relevant f 1=1
Figure 82480DEST_PATH_IMAGE053
f 2=1
Figure 501960DEST_PATH_IMAGE053
f 3
Figure 853307DEST_PATH_IMAGE055
1
Figure 205791DEST_PATH_IMAGE053
f 4
Figure 617181DEST_PATH_IMAGE056
0.5
Type linguistic context placed in the middle is relevant f 1=1
Figure 55116DEST_PATH_IMAGE053
f 2=1
Figure 944574DEST_PATH_IMAGE057
0.5
Figure 417144DEST_PATH_IMAGE055
f 4
Figure 996505DEST_PATH_IMAGE055
0.5
Figure 390577DEST_PATH_IMAGE053
(f 3 0.823
Figure 145224DEST_PATH_IMAGE058
f 9
Figure 429575DEST_PATH_IMAGE055
1)
The backward type linguistic context is relevant f 1=1
Figure 310943DEST_PATH_IMAGE053
f 2=1
Figure 338942DEST_PATH_IMAGE053
f 3
Figure 989366DEST_PATH_IMAGE055
1
Figure 647880DEST_PATH_IMAGE053
f 4
Figure 547703DEST_PATH_IMAGE059
0.5
Different order type f 1=1
Figure 582655DEST_PATH_IMAGE053
f 2=0
Figure 884324DEST_PATH_IMAGE053
f 11=1
Different font f 1
Figure 713740DEST_PATH_IMAGE055
1
Figure 569700DEST_PATH_IMAGE060
f 7
Figure 939502DEST_PATH_IMAGE061
f 10
Figure 298939DEST_PATH_IMAGE062
f 7
Figure 95993DEST_PATH_IMAGE063
0.05
Figure 173671DEST_PATH_IMAGE053
f 9=1
Figure 347163DEST_PATH_IMAGE053
f 11=1))
Note, because linguistic context is the concept of a semantic level, so be difficult to judge with computer intelligence ground a candidate is called for short whether linguistic context is relevant, the judgement that utilizes constraint function to be similar to from the word-building rule aspect among the present invention.
In the form 2, the meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest.
In the form 2, the meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest.
In the form 2, the meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn.
In the form 2, the irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the Can and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar.
In the form 2, the meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn.
In the form 2, the meaning directly perceived of different order type: Fn comprises all words among the Can but word order has change, and it is the highest that Can is called for short concentrated frequency the candidate.
In the form 2, the meaning directly perceived of different font: Fn does not comprise all words among the Can but the frequency of Can is very high or to be called for short concentrated relative frequency the candidate very high.
In above-mentioned step 6-3, when input be the number of full name in the full name document of single full name or input less than 1000 the time, this step is not carried out, otherwise, according to above introducing complete full graph of a relation FAG=(F, the A of being called for short of patterning process structure that is called for short graph of a relation, E, f).The concrete grammar that utilizes full abbreviation graph of a relation to verify is as follows:
Figure 826686DEST_PATH_IMAGE064
If,
Figure 794642DEST_PATH_IMAGE065
Then
Figure 359616DEST_PATH_IMAGE066
If, v iThe abbreviation type be not the linguistic context independent type, then for full name v kThis candidate is called for short v iWrong.
In above-mentioned step 6-4, the implication of classification " F " is mistake, and the implication of classification " T " is correct.
By obtaining the abbreviation collection of known full name after the above-mentioned checking, the below discusses and sorts to being called for short concentrated abbreviation.
In the present invention, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort abbreviation to sort to being called for short.
PRI (Cfn, An) is defined as follows:
Figure 540061DEST_PATH_IMAGE067
Wherein,
Figure 405249DEST_PATH_IMAGE068
, Be the weight that each function is taked when the comprehensive evaluation, F iWith
Figure 596376DEST_PATH_IMAGE069
Between corresponding relation see Table 4,
Figure 314933DEST_PATH_IMAGE069
Size obtain by experiment according to the degree of restraint of each function to full abbreviation relation:
Form 3
Numbering The function content The function weight
Figure 777837DEST_PATH_IMAGE069
F 1 The word of Can is from the ratio among the Fn 0.12
F 2 The word order of Fn and Can 0.08
F 3 Can is to the word-coverage rate of Fn 0.06
F 4 Can covers center of gravity to the participle of Fn 0.08
F 5 The longest continuative participle number that is not covered by Can among the Fn 0.04
F 6 The length relation of Fn and Can 0.06
F 7 The frequency that Can occurs in GoogleArchSet (Fn) 0.10
F 8 The word of Can is from the relative ratios among the Fn 0.12
F 9 Can is called for short concentrated relative coverage ratio the candidate 0.10
F 10 Can is called for short concentrated frequency the candidate 0.12
F 11 The candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein 0.14
For actual effect of the present invention is described, adopts method of the present invention that multidisciplinary full name is looked for being called for short and done great many of experiments.We have randomly drawed 3910 Chinese Fn from multidisciplinary, utilize the present invention to search its An, the results are shown in form 5.
The experimental result that form 4 Fn search An
The Fn number Get access to the Fn number of An Get access to the number percent of the Fn of An The number of all An Search the accuracy (sampling) of An
3910 3288 84.09% 5321 94.81%
We have randomly drawed 2140 abbreviations and have verified with the joint verification method from above-mentioned experiment, table 5 is results of checking.
The result of form 5 joint verifications
True mark Y N Accuracy rate Recall rate
Y 1745 36 95.87% 97.98%
N 75 284 88.75% 79.11%
Can draw the following conclusions by experiment: the present invention has preferably effect to obtaining of Chinese abbreviation, and is applied widely, can finely remedy the defective that Chinese abbreviation obtains previous methods.
Embodiment recited above is described preferred implementation of the present invention; be not that the spirit and scope of the present invention are limited; under the prerequisite that does not break away from design concept of the present invention; common engineering technical personnel make technical scheme of the present invention in this area various modification and improvement; all should fall into protection scope of the present invention; the technology contents that the present invention asks for protection all is documented in claims.

Claims (10)

1. method of obtaining Chinese abbreviation from the Web webpage is characterized in that: comprise step:
Step 1, given Chinese full name Fn of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the full abbreviation relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EAN and extract candidate's abbreviation from full abbreviation language materials, form the candidate and are called for short set;
Step 5, the candidate is called for short set carries out classification based on full abbreviation relation constraint, thereby the candidate who forms with the classification mark is called for short set;
Step 6, the candidate is called for short set carries out based on full abbreviation relation constraint and entirely be called for short the joint verification of graph of a relation, be called for short set thereby form;
Step 7, abbreviation of the same type carries out prioritization in the set to being called for short, thereby forms the orderly abbreviation set with the classification mark.
2. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 1 is characterized in that: in described step 2, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
3. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 2, described query pattern comprises three kinds: query pattern 1: " Fn abbreviation ", query pattern 2: " Fn* abbreviation ", query pattern 3: " full name Fn "; Query pattern 2 is the expansions to query pattern 1, has added between " Fn " and " abbreviation " one " * ", and " * " can mate any one word in the Google inquiry; Because tend to occur the language material of " sinus rhythm " and so in the webpage, this language material can't retrieve with query pattern 1, but utilizes query pattern 2 just can retrieve; Search order is for selecting first query pattern 1, next query pattern 2, last query pattern 3.
4. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 4, be called for short extraction algorithm EAN and comprise two algorithm CAEA1 and CAEA2, when selecting query pattern 1 or query pattern 2 in the step 2, adopt CAEA1 to extract An in the step 4; When selecting query pattern 3 in the step 2, adopt CAEA2 to extract An in the step 4.
5. A kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 1 or query pattern 2, step 4 and step 5 were carried out following steps:
Steps A-1, the candidate who utilizes algorithm CAEA1 to extract with tag from full abbreviation language material are called for short collection;
Steps A-2, utilize An right margin vocabulary to determine that again the candidate is called for short the right margin that concentrated candidate is called for short;
In steps A-2, An right margin vocabulary is to be generated through artificial checking by An right margin vocabulary to be verified, in algorithm CAEA1 An right margin vocabulary to be verified is added dynamically;
In above-mentioned steps 3, entirely be called for short in the language material the full sentence that is called for short and be divided into six types: half label type, rear portion somatotype, All-in-One type, label are to type, without prefix type with prefix type is arranged; The candidate who extracts from this full abbreviation sentence of six types is called for short, and its type is the corresponding full type that is called for short sentence;
Half label type: Yi Bian the right and left of Can only has matching symbol is arranged, illustrate that this sentence does not probably comprise complete An; The rear portion somatotype: be called for short in the sentence complete, Fn is the rear section of another full name " * Fn ", so Can also is that " * Fn " is right
The rear section of the abbreviation of answering " * Can ", because excessively reduction, Can probably is not the abbreviation of Fn;
All-in-One type: Fn composition as a whole occurs with other full name, and whole abbreviation is that the combination type of several full name is called for short; The structure of this language material has obvious characteristic a: Fn before whole decline and the Fn conjunction to be arranged;
Label is to type: the Fn front is without Chinese character, and Can is paired symbol and marks, and need not to utilize algorithm to determine the border of Can, directly extraction again;
Without prefix type: the Fn front is without Chinese character, and Can is not paired symbol and marks, and Can need not to determine left margin, but needs decide right margin;
Prefix type is arranged: there is Chinese character the Fn front, and Can need to determine left margin and right margin;
In steps A-1, the particular content of described algorithm CAEA1 is as follows:
The candidate is called for short extraction algorithm 1:(candidate abbreviation extract algorithm CAEA1)
Input: entirely be called for short sentence Fa_sent
Output: the candidate of belt type mark is called for short Can
Will Fa_sentResolve into Before, FnWith Can_sentThree parts, wherein FnKnown full name, BeforeTo be positioned in full the abbreviation in the sentence FnThe Chinese character string of front, Can_sentAt the full Chinese character string that is positioned at " abbreviation " back in the sentence that is called for short;
Can_sentWord list be shown Can_sent= P 1 P 2 P n , wherein P i Represent a Chinese character;
Definition Can Can_sentIn left margin Left=1And right margin Right=n, definition CanType mark Tag=null
If Can_sentThe left side is that pairing label and the right is not corresponding pairing label
Then TagHalf label type
end if
if before = null
if tag = null
Then TagWithout prefix type
end if
Turn step6
end if
if before!= null and tag = null
Then TagPrefix type is arranged
end if
If BeforeThe last character be " with " or " with " or " reaching "
then for each P i ∈{P 1 P 2 ……P n }
If P i Do not exist FnMiddle appearance
Then TagThe All-in-One type
Turn step5
end if
end for each
end if
for each P i ∈{P 1 P 2 ……P n }
If P i Do not exist FnIn and appears P i BeforeMiddle appearance
then left i+1
end if
If P i FnMiddle appearance
break;
end if
end for each
if left>1
Then TagThe rear portion somatotype
end if
If Can_sentBy label to marking and Tag=without prefix type
Then TagLabel is to type
end if
for each P i ∈{P left P left+1 ……P n-1 }
If P i FnLast participle in and appears P I+1 Do not exist FnMiddle appearance
then right i
Will P i A word on the right joins in the An right margin vocabulary to be verified
end if
end for each
can P left P left+1 ……P right
Return can 。
6. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 3, step 4 and step 5 were carried out following steps:
Step B-1, utilize algorithm CAEA2 from full abbreviation language material, to extract the candidate to be called for short collection;
The particular content of described algorithm CAEA2 is as follows:
The candidate is called for short extraction algorithm 2:(candidate abbreviation extract algorithm CAEA2)
Input: entirely be called for short sentence Fa_sent
Output: the candidate is called for short Can
Will Fa_sentResolve into Can_sent, FnWith BehindThree parts, wherein FnKnown full name, Can_sentAt the full Chinese character string that is positioned at " full name " front in the sentence that is called for short, BehindTo be positioned in full the abbreviation in the sentence FnThe Chinese character string of back;
Right Can_sentWith BehindDifference participle and mark part of speech, word segmentation result is: { P 1P 2P kAnd { R 1R 2R n, definition Can Can_sentIn one-level left margin subscript Left1=1, secondary left margin subscript Left2=1, the left margin subscript Left=1With the right margin subscript Right=k
The definition verb can intercept sign flag_v=0, and right margin can intercept sign flag_right=0 according to part of speech;
for each P i∈ {P 1P 2……P k}
If P iWith fn identical word is arranged
Then flag_v 1; //P iVerb afterwards all cannot be as left margin
end if
If P iWith fn identical word and is arranged Left2=1
Then left2 i; // P iIt may be first participle of can
end if
If P iPart of speech be " conjunction " or " preposition " or " auxiliary word "
then left1 i+1;
end if
If P iPart of speech be " verb " and flag_v=0
then left1 i+1;
end if
end for each
for each P j∈ {P kP k-1……P 1}
If P jWith fn identical word is arranged
Then flag_right 1; // Pj may be the participle of can
end if
If P jPart of speech be " conjunction " or " preposition " or " auxiliary word " or " verb "
and flag_right = 0
then right j-1;
end if
If P jWith behind identical word and P is arranged jWith fn without identical word
then right j-1;
end if
If P jBe punctuation mark
then right j-1;
end if
end for each
if left2 <= right
then left left2
end if
if left1 <= right
then left left1
end if
return can {P left……P right} 。
7. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 1 is characterized in that: in the above-mentioned steps 6, if be called for short set for empty, and also have query pattern available in the step 2, then re-execute step 2 to 7; If be called for short set for empty, do not have alternative query pattern in the step 2 simultaneously, then withdraw from, show can not from Web search the abbreviation of given full name.
8. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 6, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is full name, and An is the abbreviation of Fn, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy; The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively;
It is complete that to be called for short graph of a relation FAG (Fullname and Abbreviation Graph) be a four-tuple, i.e. FAG=(F, A, E, f), wherein,
Figure 759761DEST_PATH_IMAGE001
The full name collection,
Figure 507137DEST_PATH_IMAGE002
To be called for short collection, F
Figure 389642DEST_PATH_IMAGE003
A is vertex set,
Figure 148651DEST_PATH_IMAGE004
Be the nonoriented edge collection, f is that E is to F
Figure 91199DEST_PATH_IMAGE005
Mapping on the A, namely
Figure 60292DEST_PATH_IMAGE006
, always have the summit
Figure 887434DEST_PATH_IMAGE007
With , so that
Figure 739032DEST_PATH_IMAGE009
Set up, that is to say
Figure 67858DEST_PATH_IMAGE010
To connect
Figure 292166DEST_PATH_IMAGE011
With
Figure 150400DEST_PATH_IMAGE012
Nonoriented edge.
9. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 8, it is characterized in that: the specific implementation step of described step 6 is as follows:
Step 6-1, the axiom of constraint 1-5 checking candidate who utilizes axiom of constraint to concentrate are called for short each concentrated candidate and are called for short;
Step 6-2, the candidate is called for short concentrated candidate is called for short and carries out the classification of Constraint-based collection of functions;
Step 6-3, structure are called for short graph of a relation entirely, utilize full abbreviation graph of a relation that the candidate is called for short each concentrated candidate's abbreviation and verify;
Step 6-4, be called for short tag classification, class categories and the constraint function collection generates decision tree by the candidate, utilize decision tree that the candidate is called for short concentrated candidate and be called for short and classify, the candidate who removes classification and be " F " is called for short, and retention class is that the candidate of " T " is called for short; The implication of classification " F " is mistake, and the implication of classification " T " is correct;
In above-mentioned step 6-1, be called for short each concentrated candidate for the candidate and be called for short Can, whether checking Fn and Can satisfy the constraint requirements of axiom 1-4, if do not satisfy then this candidate's abbreviation is wrong;
In above-mentioned step 6-2, the concrete grammar of classification is as follows: according to being called for short whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to Fn at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type according to An to the covering center of gravity of Fn;
The condition that concrete criteria for classification and all kinds of abbreviation need to satisfy is:
The meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest;
The meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest;
The irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest;
The irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest;
The meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn;
The irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the Can and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar;
The meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn;
The meaning directly perceived of different order type: Fn comprises all words among the Can but word order has change, and it is the highest that Can is called for short concentrated frequency the candidate;
The meaning directly perceived of different font: Fn does not comprise all words among the Can but the frequency of Can is very high or to be called for short concentrated relative frequency the candidate very high;
In above-mentioned step 6-3, when input be the number of full name in the full name document of single full name or input less than 1000 the time, this step is not carried out, otherwise, according to the full graph of a relation FAG=(F, A, E, f) that is called for short of patterning process structure of full abbreviation graph of a relation;
The concrete grammar that utilizes full abbreviation graph of a relation to verify is as follows:
Figure 841276DEST_PATH_IMAGE013
If, Then
Figure 547381DEST_PATH_IMAGE015
If, v iThe abbreviation type be not the linguistic context independent type, then for full name v kThis candidate is called for short v iWrong;
By obtaining the abbreviation collection of known full name after the above-mentioned checking, the below sorts to being called for short concentrated abbreviation;
According to priority comprehensive function PRI (Fn, Can) concentrates of a sort abbreviation to sort to being called for short;
PRI (Fn, Can) is defined as follows:
Wherein,
Figure 590740DEST_PATH_IMAGE017
,
Figure 21721DEST_PATH_IMAGE018
The weight of taking when the comprehensive evaluation for each function.
10. require 8 or 9 described a kind of methods of obtaining Chinese abbreviation from the Web webpage according to claim, it is characterized in that: the concrete meaning of described constraint function collection is:
The word of constraint function 1:Can is from the ratio among the Fn
Each Chinese character among the Can comes among the Fn, is called for short concentratedly the candidate, and it is higher to appear at the priority that the higher candidate of the ratio of the word among the Fn is called for short;
The formal definition of constraint function 1 and being calculated as follows:
Figure 587832DEST_PATH_IMAGE019
The word order of constraint function 2:Fn and Can
The order of the word among the Can is strictly arranged sequentially by what occur in Fn;
The formal definition of constraint function 2 and being calculated as follows
Figure 30446DEST_PATH_IMAGE020
Fn is identical with the Can word order, and all words that containing among the Can all appear among the Fn, if the word that does not appear among the Fn is arranged among the Can, then the value of constraint function 2 is 0;
Constraint function 3:Can is to the word-coverage rate of Fn
It is more that the candidate is called for short the participle that covers full name, just more may become correct abbreviation;
The formal definition of constraint function 3 and being calculated as follows:
Figure 656599DEST_PATH_IMAGE021
Constraint function 4:Can covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in the candidate is called for short, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name, namely among the Fn abridged participle respectively in forward part, center section and the rear section of Fn;
The formal definition of constraint function 4 and being calculated as follows:
Figure 574877DEST_PATH_IMAGE022
Wherein,
Figure 85624DEST_PATH_IMAGE023
Corresponding
Figure 241798DEST_PATH_IMAGE024
If among the Fn by Can 1The participle that covers is evenly distributed among the Fn, and among the Fn by Can 2The participle that covers all is distributed in the first half of Fn; According to constraint function 4,
Figure 570011DEST_PATH_IMAGE025
So, Can 1Priority ratio Can 2Priority high;
The longest continuative participle number that is not covered by Can among the constraint function 5:Fn
The candidate is called for short usually and is comprised of a plurality of participles, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller;
The formal definition of constraint function 5 and being calculated as follows:
Figure 585372DEST_PATH_IMAGE026
Wherein, N represents the not number of capped participle string contained among the Fn
The length relation of constraint function 6:Fn and Can
The candidate is called for short corresponding full name length at the 1.5-5 that is called for short length for the candidate doubly, and the probability that full name length exceeds this scope is less;
The formal definition of constraint function 6 and being calculated as follows:
Figure 758864DEST_PATH_IMAGE027
The frequency that constraint function 7:Can occurs in GoogleArchSet (Fn)
Searched to the Google when being called for short by full name, the priority of candidate's abbreviation that occurrence frequency is higher among GoogleArchSet (Fn) is higher;
The formal definition of constraint function 7 and being calculated as follows:
Figure 300704DEST_PATH_IMAGE028
When searching An by Fn, obtain sometimes several candidates and be called for short, they consist of the candidate and are called for short collection CanSet (Fn), are called for short Can for any one candidate among the CanSet (Fn) i, analyze FA(Fn, Can i) time can analogy CanSet (Fn) in the desired value that is called for short of other candidate;
4 following constraint functions are based on the candidate and are called for short the collection definition;
The word of constraint function 8:Can is from the relative ratios among the Fn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's abbreviation in CanSet (Fn);
The formal definition of constraint function 8 and being calculated as follows:
Figure 268660DEST_PATH_IMAGE029
Constraint function 9: the candidate at Fn is called for short the relative coverage ratio of concentrating Fn
Compare with constraint function 3, constraint function 9 is emphasized the relativity of Can in CanSet (Fn), and when the candidate is called for short when high to the coverage rate of full name, the priority that the candidate that coverage rate is relatively high so is called for short is higher;
The formal definition of constraint function 9 and being calculated as follows:
Figure 771317DEST_PATH_IMAGE030
Constraint function 10:Can is called for short concentrated frequency the candidate
When searching Can by Fn, sometimes the candidate to be called for short the frequency of concentrating all candidates to be called for short all very low, the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency that each candidate is called for short, be called for short concentratedly the candidate, the priority that the relatively high candidate of frequency is called for short is higher;
The formal definition of constraint function 10 and being calculated as follows:
Figure 748500DEST_PATH_IMAGE031
Constraint function 11: the candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein is when the candidate is called for short concentrated element is many, the candidate's that frequency is lower importance is relatively low;
The formal definition of constraint function 11 and being calculated as follows:
The importance that the candidate that the value of constraint function 11 is lower is called for short is lower;
The concrete meaning of described axiom of constraint collection is:
Axiom of constraint 1: the long axiom that do not wait of word
Form represents:
Figure 424649DEST_PATH_IMAGE033
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of Can;
Axiom of constraint 2: indicative mood axiom
Form represents:
Figure 8077DEST_PATH_IMAGE034
Do not comprise interrogative among meaning: Fn directly perceived and the Can;
Axiom of constraint 3: form does not repeat axiom
Form represents:
Figure 788951DEST_PATH_IMAGE035
Meaning directly perceived: be called for short in the relation complete, Fn and Can cannot be the Chinese character strings of ss form, and wherein s is Chinese character string;
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Figure 177820DEST_PATH_IMAGE036
Meaning directly perceived: the Chinese character that all appear among the Fn, the number of times that occurs in Fn must be not less than the number of times that occurs in Can;
Axiom of constraint 5: do not make a general reference axiom
Form represents:
Figure 487579DEST_PATH_IMAGE037
Meaning directly perceived: the candidate is called for short corresponding full name should be less than or equal to 5.
CN2011102531213A 2011-08-31 2011-08-31 Method for acquiring shortened form in Chinese from Web page Pending CN102955819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102531213A CN102955819A (en) 2011-08-31 2011-08-31 Method for acquiring shortened form in Chinese from Web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102531213A CN102955819A (en) 2011-08-31 2011-08-31 Method for acquiring shortened form in Chinese from Web page

Publications (1)

Publication Number Publication Date
CN102955819A true CN102955819A (en) 2013-03-06

Family

ID=47764630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102531213A Pending CN102955819A (en) 2011-08-31 2011-08-31 Method for acquiring shortened form in Chinese from Web page

Country Status (1)

Country Link
CN (1) CN102955819A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956192A (en) * 2016-06-15 2016-09-21 中国互联网络信息中心 Method and system for acquiring shortened form of organization name based on website homepage information
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN110502685A (en) * 2019-08-02 2019-11-26 阿里巴巴集团控股有限公司 A kind of data optimization methods based on search engine, device and equipment
CN113220863A (en) * 2021-07-07 2021-08-06 企查查科技有限公司 Extraction method, device and storage medium for company effective abbreviation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUANG JIANG: ""A General Approach to Extracting Full Names and Abbreviations for Chinese Entities from the Web"", 《 INTELLIGENT IFIP ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY》 *
GUOGANG TIAN: ""MFC: A Method of Co-referent Relation Acquisition from Large-Scale Chinese Corpora"", 《LECTURE NOTES IN COMPUTER SCIENCE》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956192A (en) * 2016-06-15 2016-09-21 中国互联网络信息中心 Method and system for acquiring shortened form of organization name based on website homepage information
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN110502685A (en) * 2019-08-02 2019-11-26 阿里巴巴集团控股有限公司 A kind of data optimization methods based on search engine, device and equipment
CN113220863A (en) * 2021-07-07 2021-08-06 企查查科技有限公司 Extraction method, device and storage medium for company effective abbreviation

Similar Documents

Publication Publication Date Title
CN109271626B (en) Text semantic analysis method
CN108763333B (en) Social media-based event map construction method
CN105426539B (en) A kind of lucene Chinese word cutting method based on dictionary
Alzahrani et al. Fuzzy semantic-based string similarity for extrinsic plagiarism detection
Zhang et al. An empirical study of TextRank for keyword extraction
CN105868313A (en) Mapping knowledge domain questioning and answering system and method based on template matching technique
CN104239286A (en) Method and device for mining synonymous phrases and method and device for searching related contents
CN102254014A (en) Adaptive information extraction method for webpage characteristics
CN101093478A (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN109614620B (en) HowNet-based graph model word sense disambiguation method and system
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN104598441B (en) A kind of method that computer splits Chinese sentence
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN102955819A (en) Method for acquiring shortened form in Chinese from Web page
CN104346382B (en) Use the text analysis system and method for language inquiry
CN111221976A (en) Knowledge graph construction method based on bert algorithm model
Celebi et al. Segmenting hashtags using automatically created training data
Rondon et al. Never-ending multiword expressions learning
CN103544167A (en) Backward word segmentation method and device based on Chinese retrieval
CN102955818A (en) Method for acquiring full names in Chinese from Web page
Zamin et al. A statistical dictionary-based word alignment algorithm: An unsupervised approach
Tissot et al. Fast phonetic similarity search over large repositories
Sinha et al. Hindi-English language identification, named entity recognition and back transliteration: shared task system description
Saleh et al. Semantic kernels for semantic parsing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130306