CN102955819A - Method for acquiring shortened form in Chinese from Web page - Google Patents
Method for acquiring shortened form in Chinese from Web page Download PDFInfo
- Publication number
- CN102955819A CN102955819A CN2011102531213A CN201110253121A CN102955819A CN 102955819 A CN102955819 A CN 102955819A CN 2011102531213 A CN2011102531213 A CN 2011102531213A CN 201110253121 A CN201110253121 A CN 201110253121A CN 102955819 A CN102955819 A CN 102955819A
- Authority
- CN
- China
- Prior art keywords
- short
- called
- candidate
- abbreviation
- constraint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to a method for acquiring a shortened form in Chinese from a Web page. The method comprises the steps of: inputting a known full name, selecting a query mode to establish a query item, submitting the query item to Google for acquiring an anchor text, then acquiring the corpus of the full names and the short forms from the anchor text, finally picking up candidate short forms by utilizing pick-up algorithms, and then sequencing the candidate short forms by utilizing the priority synthetic function, wherein three query modes are related, and the two corresponding pick-up algorithms for picking up short forms are used. The invention also defines the constraint of the relation between the full name and the short form, wherein the constraint includes a set of constraint axiom and a constraint function set, the constraint axiom qualitatively expresses the constraint between the full name and the short form, the constraint function set quantitatively expresses the constraint between the full name and the short form; moreover, a classification method for the full name and the short form is provided based on the constraint between the full name and the short form. The invention also defines a full name-short form relation graph and provides a joint testing method based on the full name- short form relation graph and the relation constraint between the full name and the short form.
Description
Technical field
The abbreviation that the present invention relates to Chinese information processing and information retrieval field obtains technology, relates in particular to a kind of method of obtaining Chinese abbreviation from the Web webpage, obtains the method for the Chinese abbreviation of multidisciplinary, extensive, high-accuracy from the Web webpage.
Background technology
Natural language processing is a major issue in computer science and the artificial intelligence field.Its research can realize carrying out with natural language between people and the computing machine various theories and the method for efficient communication.Widespread use along with computing machine and internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, the object of natural language processing is also processed from the small-scale restricted language and is turned to extensive real text to process, and its research will produce far-reaching influence to people's life.
Chinese information processing is to study how to utilize computing machine that Chinese information is processed automatically.Chinese is that a meaning is closed language, compares with western language, lacks explicit mark, and grammer, semanteme, pragmatic side are also more flexible, have increased the difficulty of computer understanding and processing, allow computing machine can process Chinese information, still have many difficulties to overcome.At present, Chinese information processing has obtained some achievements in fields such as speech recognition, participle, mechanical translation.The lifting of Chinese information robotization degree for the treatment of will bring considerable benefit to the science and technology of China, culture, economy, safety etc.
How quick from the bulk information of numerous and complicated Research into information retrieval is, the technology of Obtaining Accurate information needed.Information retrieval technique is through for many years development, and quite ripe at present, the novel information retrieval technique is just towards future developments such as intellectuality, mobilism, variation, personalizations.
Full name (Full Name, Fn) be complete address to title, be called for short (Abbreviation, An) to be brevity and lucidity in order expressing, and the address that obtains after the compression to be simplified in full name, if Fn and An have full abbreviation relation, claim that then Fn is the full name of An, An is the abbreviation of Fn, is denoted as FA(Fn, An).By full name to being called for short, can be regarded as the compression process of a quantity of information, by being called for short to full name, then can be regarded as the process of a decompress(ion), for example: c1=" Inst. of Computing Techn. Academia Sinica " is compressed, obtain c2=" institute is calculated by the Chinese Academy of Sciences ", again c2 is compressed, obtain c3=" Computer Department of the Chinese Academy of Science ", the c3 decompress(ion) is obtained c2, again the c2 decompress(ion) is obtained c1.Full name all is relative concept with being called for short, and such as in upper example, c2 is to be called for short with respect to c1, but is full name with respect to c3, says that separately c2 is full name or to be called for short all be nonsensical.
The full Relation acquisition that is called for short obtains (Knowledge Acquisition from Text as text knowledge, KAT) and information retrieval etc. use in a basic and crucial problem, its acquisition methods can be divided into two large classes: a class is based on the method for pattern, mainly utilize linguistics and natural language processing technique, extract relation schema by lexical analysis and grammatical analysis, then utilize pattern match to obtain full abbreviation relation, the method accuracy rate depends on linguistic knowledge and pattern base; The another kind of method that is based on statistics mainly based on corpus and statistical language model, is obtained full abbreviation relation by the degree of association of calculating between the concept, and the method accuracy rate and efficient are difficult to the real requirement that reaches desirable.The full problem of obtaining that is called for short relation again can be from two angles: one is the angle of excavating, and it is right to obtain full abbreviation exactly under the condition that does not have extraneous input; Another is the angle of searching, and known exactly full name looks for abbreviation or known abbreviation to look for full name.
" full name " mentioned among the present invention or " abbreviation " if no special instructions, all refer to Chinese full name or Chinese abbreviation.
Summary of the invention
For the limitation or the not high defective of accuracy rate that have in the existing full abbreviation Relation acquisition technology, the invention provides a kind of accuracy rate height and be applicable to multidisciplinary, ultra-large a kind of method of from the Web webpage, obtaining Chinese abbreviation.
In order to address the above problem, the invention provides a kind of method of from the Web webpage, obtaining Chinese abbreviation, comprise step:
Step 1, given Chinese full name Fn of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the full abbreviation relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EAN and extract candidate's abbreviation from full abbreviation language materials, form the candidate and are called for short set;
Step 5, the candidate is called for short set carries out classification based on full abbreviation relation constraint, thereby the candidate who forms with the classification mark is called for short set;
Step 6, the candidate is called for short set carries out based on full abbreviation relation constraint and entirely be called for short the joint verification of graph of a relation, be called for short set thereby form;
Step 7, abbreviation of the same type carries out prioritization in the set to being called for short, thereby forms the orderly abbreviation set with the classification mark.
In the technique scheme, in described step 2, described query pattern comprises three kinds: query pattern 1: " Fn abbreviation ", query pattern 2: " Fn* abbreviation ", query pattern 3: " full name Fn ".Query pattern 2 is the expansions to query pattern 1, and we have added one " * " between " Fn " and " abbreviation ", and " * " can mate any one word in the Google inquiry.Because tend to occur the language material of " sinus rhythm (hereinafter to be referred as hole rule) " and so in the webpage, this language material can't retrieve with query pattern 1, but utilizes query pattern 2 just can retrieve.We do experiment with 4000 Chinese Fn, wherein account for 64.65% with what query pattern 1 can get access to An, account for 61.18% with what query pattern 2 can get access to An, account for 21.02% with what query pattern 3 can get access to An, account for 82.51% with what query pattern 1 or query pattern 2 can get access to An, account for 84.10% with what query pattern 1,2,3 can get access to An.Therefore, in order to improve search efficiency, we preferentially select query pattern 1, secondly query pattern 2, at last query pattern 3.
In the technique scheme, in described step 4, be called for short extraction algorithm (EAN) and comprise two algorithm CAEA1 and CAEA2, when selecting query pattern 1 or query pattern 2 in the step 2, adopt CAEA1 to extract An in the step 4, when selecting query pattern 3 in the step 2, adopt CAEA2 to extract An in the step 4.
In the technique scheme, in described step 6, if be called for short set for empty, and also have query pattern available in the step 2, then re-execute step 2-7; If be called for short set for empty, do not have alternative query pattern in the step 2 simultaneously, then withdraw from, show can not from Web search the abbreviation of given full name.
In the technique scheme, in described step 6, entirely being called for short relation constraint is four-tuple R=(Fn, an An, F, A), wherein, Fn is full name, An is the abbreviation of Fn, and F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Hereinafter will further make an explanation to these two kinds of constraints.
In the technique scheme, in described step 6, entirely being called for short graph of a relation FAG (Fullname and Abbreviation Graph) is a four-tuple, i.e. FAG=(F, A, E, f), wherein,
The full name collection,
To be called for short collection, F
A is vertex set,
Be the nonoriented edge collection, f is that E is to F
Mapping on the A, namely
, always have the summit
With
, so that
Set up, that is to say
To connect
With
Nonoriented edge.
Beneficial effect: the present invention is the abbreviation that obtains its correspondence according to known full name from Web, namely obtain full abbreviation relation from the angle of searching, utilizing the schema-based method to come to obtain the candidate from Google is called for short, utilization comes candidate's abbreviation is verified based on the method for statistics, have multidisciplinary property, extensive, high accuracy for examination, and inquired into the classification that is called for short with computer realization, obtaining for the intelligence of extensive knowledge provides effective support.
Description of drawings
Fig. 1 is the full example that is called for short graph of a relation;
Fig. 2 utilizes query pattern 1 or query pattern 2 to obtain the process flow diagram of abbreviation;
Fig. 3 utilizes query pattern 3 to obtain the process flow diagram of abbreviation;
Fig. 4 is for being called for short the process flow diagram that collection carries out joint verification to the candidate;
Fig. 5 checking decision tree that the full type that is called for short and constraint collection of functions generate of serving as reasons.
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments:
Before method of the present invention is described, at first the formation rule and the word formation that are called for short in the full abbreviation relation are put in order and summed up.Be called for short in the relation complete, can be regarded as the compression process of a quantity of information to the process that is called for short by full name, in the compression process of quantity of information, sometimes have semantic equivalence conversion and the adjustment of word order, be divided into plain edition, different font and different order type so we will be called for short relation entirely.
Plain edition: each word in the abbreviation appears in the full name, and keeps their orders in full name, for example, and Fn=" People's Republic of China (PRC) ", An=" China ";
Different font: some word in the abbreviation does not occur in full name, has namely not only carried out the compression of quantity of information by full name to being called for short, and has also carried out semantic equivalence conversion, Fn=" Wa Huang Shengmumiao " for example, An=" Chinese mythology goddess mausoleum ";
Different order type: the order in the abbreviation between Chinese character is inconsistent with their orders of tie element in full name, for example, Fn=" Harbin the 6th pharmaceutical factory ", An=" breathes out medicine six factories ".
Below introduce in detail the complete relevant definition that is called for short graph of a relation and full abbreviation relation constraint.
To consisting of a bipartite graph, concrete grammar is by a collection of full abbreviation: all full name consist of the full name collection
, all abbreviations consist of the abbreviation collection
, the vertex set of F and A pie graph
,
Fn
F
An
A if fn and an consist of a pair of full abbreviation, then constructs a nonoriented edge that connects fn and an.
In the present invention, defined full abbreviation graph of a relation and represented contact between Fn and the An, entirely being called for short graph of a relation FAG (Fullname and Abbreviation Graph) is a four-tuple, i.e. FAG=(F, A, E, f), wherein,
The full name collection,
To be called for short collection, F
A is vertex set,
Be the nonoriented edge collection, f is that E is to F
Mapping on the A, namely
, always have the summit
With
, so that
Set up, that is to say
To connect
With
Nonoriented edge.
Fig. 1 is full graph of a relation, wherein a full name collection of being called for short
, be called for short collection
Given full abbreviation graph of a relation FAG=(F, A, E, f),
, the total existence
With
, so that
, claim the summit
With
With the limit
Association, the summit
With
Adjacent.
Given full abbreviation graph of a relation FAG=(F, A, E, f),
, with
All adjacent summits form
Adjacent point set, be designated as Adj (
), with
The number on all adjacent summits is called
The number of degrees, be designated as
In the present invention, define full abbreviation relation constraint and represented constraint between Fn and the An, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is full name, and An is the abbreviation of Fn, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy.The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively.Before constraint function collection and axiom of constraint collection are elaborated, be listed in the basic symbol that hereinafter uses:
Fn represents full name;
An represents the abbreviation of Fn;
Can represents that the candidate of Fn is called for short;
The Google anchor text set of GoogleArchSet (Fn) expression Fn is when namely searching abbreviation corresponding to Fn from Google
The set of the front 100 anchor texts that return, if the anchor text that returns sum N is less than 100, then GoogleArchSet (Fn) only comprises only N bar anchor text;
The candidate of CanSet (Fn) expression Fn is called for short collection, candidate corresponding to Fn who namely extracts from GoogleArchSet (Fn)
Be called for short the set that forms;
The number that contained candidate is called for short among N_CanSet (Fn) the expression CanSet (Fn);
FnSet (Can) expression candidate is called for short full name collection corresponding to Can, and namely the candidate of each Fn among the FnSet (Can) is called for short
Concentrate and all contain Can;
The number of contained full name among N_FnSet (Can) the expression FnSet (Can);
FA (Fn, An) expression Fn and An have full abbreviation relation;
The number of contained Chinese character among length (str) the expression Chinese character string str;
N_word (Fn, An) expression appears at the Chinese character number among Fn and the An simultaneously;
Behind N_Clas (Fn) the expression Fn process participle, the participle number of appearance;
The participle number that is covered by An among N_Cover (Fn, An) the expression Fn;
The set of the participle that is covered by An among CoverSet (Fn, An) the expression Fn;
p
i: i participle in the expression full name;
p
1/ p
2/ ... / p
m: expression is by participle p
1, p
2P
mThe segmentation sequence that forms, wherein/separation between the expression participle
Symbol;
The position of the participle central point of centre (Fn) expression Fn, after namely Fn passes through participle, the position of that middle participle
Put, or the mean place of those middle two participles, centre (Fn)=(N_Clas (Fn)+1)/2;
d
i(Fn) i the participle p of expression Fn
iCenter offset, i.e. the i of the position of the participle central point of Fn and Fn
Displacement between the position of participle, d
i(Fn)=i-centre (Fn);
(Fn) the center of maximum side-play amount of expression Fn, i.e. the center offset ground maximal value of all participles of Fn,
(Fn)=(N_Clas (Fn)-1)/2;
Len
iI not capped contained participle number of participle string of (Fn, An) expression.After Fn carried out participle, do not covered by An
Those participles that arrive are capped the participle string if link then form in Fn, if do not link then independent bunchiness, i the capped contained participle number of participle string is designated as Len
i(Fn, An);
The number of the An that freq (Fn, An) expression extracts from GoogleArchSet (Fn);
The frequency order of loca (Fn, Can) expression Can in CanSet (Fn) namely pressed the element among the CanSet (Fn)
After the big or small ascending sort of freq (Fn, Can), Can order therein;
Any Chinese character string among the S set et of NoInclude (s1, Set) expression Chinese character string is not the substring of Chinese character string s1;
How Interrogative represents interrogative set, comprises the interrogatives such as " what ", " ", " what ", " ";
Chinese character string after concat (s1, s2) represents Chinese character string s1 and Chinese character string s2 is connected;
The number of times that NumIn (s, c) expression Chinese character c occurs in Chinese character string s.
The below describes from 11 aspects to the concrete meaning that constraint function is concentrated:
The word of constraint function 1:Can is from the ratio among the Fn.
Generally speaking, full name comprises the candidate and is called for short all included Chinese characters.For example, Can=" Beijing University ", Fn=" Peking University ", each Chinese character among the Can comes among the Fn.Be called for short concentratedly the candidate, it is higher to appear at the priority that the higher candidate of the ratio of the word among the Fn is called for short.
The formal definition of constraint function 1 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Confucian Temple ", Can
1=" Confucian temple ", Can
2=" Confucian temple ".According to constraint function 1, have
So, Can
1Priority ratio Can
2Priority high.
The word order of constraint function 2:Fn and Can.
In the breviary process, most candidates are called for short the word order that is keeping in the full name.For example, Fn=" Olympic Games ", Can=" Olympic Games ", the triliteral order among the Can is strictly arranged sequentially by what occur in Fn.
The formal definition of constraint function 2 be calculated as follows (indicate: this function is consistent with patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
Attention: Fn is identical with the Can word order, and all words that containing among the Can all appear among the Fn, if the word that does not appear among the Fn is arranged among the Can, then the value of constraint function 2 is 0.
Constraint function 3:Can is to the word-coverage rate of Fn
Full name is comprised of a plurality of participles usually, one or more participles of full name can be omitted in the candidate is called for short in the situation about having, can not exceed 1/2nd of full name participle number but generally be omitted participle, it is more that the candidate is called for short the participle that covers full name, just more may become correct abbreviation.
The formal definition of constraint function 3 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Shanghai/traffic/university ", Can
1=" submitting large ", Can
2=" submitting ", according to constraint function 3,
So, Cfn
1Priority ratio Cfn
2Priority high.
Constraint function 4:Can covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, and the one or more participles in the situation about having in the full name can be omitted in the candidate is called for short, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name.For example, Can=" your boat group ", Fn=" China/Guizhou/aviation/industry/group/company ", abridged participle " China ", " industry ", " company " are respectively in forward part, center section and the rear section of Fn among the Fn.
The formal definition of constraint function 4 and being calculated as follows:
Wherein,
Corresponding
For example, Fn=" China/Guizhou/aviation/industry/group/company ", Can
1=" your boat group ", Can
2=" your boat ", among the Fn by Can
1The participle that covers " Guizhou ", " aviation " and " group " are evenly distributed among the Fn, and among the Fn by Can
2The participle that covers " Guizhou " and " aviation " all are distributed in the first half of Fn.According to constraint function 4,
So, Can
1Priority ratio Can
2Priority high.
The longest continuative participle number that is not covered by Can among the constraint function 5:Fn
The candidate is called for short usually and is comprised of a plurality of participles, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller.
The formal definition of constraint function 5 and being calculated as follows:
Wherein, N represents the not number of capped participle string contained among the Fn
For example, Fn=" China/people/republic/common property/doctrine/Communist Youth League ", Can
1=" Chinese Communist Youth League ", Can
2=" Communist Youth League ", among the Fn not by Can
1The participle that covers only has " people " and " doctrine ", and among the Fn not by Can
2The participle that covers " China ", " people " and " republic " connect together.According to constraint function 5,
So, Can
1Priority ratio Can
2Priority high.
The length relation of constraint function 6:Fn and Can
Usually the candidate of standard is called for short and can excessively reduce, and can see that to guarantee majority name knows meaning.Thereby most candidates are called for short corresponding full name length in a scope, the 1.5-5 that generally is called for short length the candidate doubly, the probability that full name length exceeds this scope is less.
The formal definition of constraint function 6 and be calculated as follows (indicate: this function is from the improvement to patent of invention " a kind of Chinese abbreviation according to entity identifies the method and system (patent No. ZL200710119513.4) of entity full name "):
For example, Fn=" Inst. of Computing Techn. Academia Sinica ", Can
1=" Computer Department of the Chinese Academy of Science ", Can
2=" calculating institute ".According to constraint function 6,
So, Can
1Priority ratio Can
2Priority high.
The frequency that constraint function 7:Can occurs in GoogleArchSet (Fn)
Searched to the Google when being called for short by full name, the priority of candidate's abbreviation that occurrence frequency is higher among GoogleArchSet (Fn) is higher.
The formal definition of constraint function 7 and being calculated as follows:
For example, Fn=" lithium ion battery ", Can
1=" lithium battery ", Can
2=" lithium electricity, Freq (Cfn
1)=42, Freq (Cfn
2)=12, according to constraint function 7,
So, Can
1Priority ratio Can
2Priority high.
When searching An by Fn, obtain sometimes several candidates and be called for short, they consist of the candidate and are called for short collection CanSet (Fn), are called for short Can for any one candidate among the CanSet (Fn)
i, analyze FA(Fn, Can
i) time can analogy CanSet (Fn) in the desired value that is called for short of other candidate.
4 following constraint functions are based on the candidate and are called for short the collection definition.
The word of constraint function 8:Can is from the relative ratios among the Fn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's abbreviation in CanSet (Fn), such as, the abbreviation of some external transliteration vocabulary does not just have identical word with full name, has carried out some synonyms when some abbreviation is reduced into full name and has transformed etc.
The formal definition of constraint function 8 and being calculated as follows:
For example, Fn=" Confucius Temple ", Can
1=" Confucian temple ", Can
2=" Confucian temple ", although
Only have 0.5, still
Also only have 0.5, so can not be because of Cfn
1The value of function 1 low just judge Cfn
1Not that correct candidate is called for short.
Constraint function 9: the candidate at Fn is called for short the relative coverage ratio of concentrating Fn
Compare with constraint function 3, constraint function 9 is emphasized the relativity of Can in CanSet (Fn), such as, some candidate is called for short not high to the coverage rate of full name, and the priority that the candidate that coverage rate is relatively high so is called for short is higher.
The formal definition of constraint function 9 and being calculated as follows:
For example, Fn=" Tsing-Hua University/with side/CD/share/limited/company ", Can
1=" Tsing Hua Tong Fang ", Can
2=" company of Tsing Hua Tong Fang " is although Can
1And Can
2Word-coverage rate to Fn is not high, but Cfn
1Word-coverage rate relatively higher, so Cfn
1Compare Cfn
2It is high that priority is wanted.
Constraint function 10:Can is called for short concentrated frequency the candidate
When searching Can by Fn, sometimes the candidate to be called for short the frequency of concentrating all candidates to be called for short all very low, the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency that each candidate is called for short, be called for short concentratedly the candidate, the priority that the relatively high candidate of frequency is called for short is higher.
The formal definition of constraint function 10 and being calculated as follows:
For example, Fn=" office of development for poverty relief leading group of autonomous region ", Can
1=" office of poverty alleviation of autonomous region ", Can
2=" office of poverty alleviation " is although according to constraint function 7, Cfn
1And Cfn
2Frequency all lower, but according to constraint function 10, Cfn
1And Cfn
2It is all higher to be called for short concentrated frequency the candidate.
Constraint function 11: the candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein
When the candidate is called for short concentrated element when many, the candidate's that frequency is lower importance is relatively low.
The formal definition of constraint function 11 and being calculated as follows:
The importance that the candidate that the value of constraint function 11 is lower is called for short is lower.
More than the concrete meaning of the constraint function constraint function concentrated from 11 aspects be illustrated, they have represented the constraint between Fn and the Can quantitatively, axiom of constraint then represents the constraint between Fn and the Can qualitatively, and the below is specifically described axiom of constraint:
Axiom of constraint 1: the long axiom that do not wait of word
Form represents:
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of Can.
Axiom of constraint 2: indicative mood axiom
Form represents:
How do not comprise interrogative " what ", " ", " what " etc. among meaning: Fn directly perceived and the Can.
Axiom of constraint 3: form does not repeat axiom
Meaning directly perceived: be called for short in the relation complete, Fn and Can cannot be the Chinese character strings of ss form, and wherein s is Chinese character string.
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Meaning directly perceived: the Chinese character that all appear among the Fn, the number of times that occurs in Fn must be not less than the number of times that occurs in Can.
For example, Fn=" Wa Huang Shengmumiao ", Can=" mausoleum, Chinese mythology goddess mausoleum " wherein appears at the Chinese character " mausoleum " among the Fn, has occurred twice in Can, and has only occurred once in Fn, so Can is incorrect.This phenomenon why can occur and be because in language material after the Can punctuation mark useless with hereinafter separate.
Axiom of constraint 5: do not make a general reference axiom
Meaning directly perceived: the candidate is called for short corresponding full name should be less than or equal to 5.
For example, Can=" company " has the candidate of 24 Fn to be called for short concentrated have " company " 4000 that test full abbreviation centerings, is candidate's abbreviation of a general reference so this candidate is called for short, and the meaning of not obtaining is in this article given up this class candidate and is called for short.
The full abbreviation graph of a relation that defines in to the present invention and the full relation constraint that is called for short have been done on the basis that describes in detail, and lower mask body is introduced the embodiment of the inventive method.
The method of obtaining Chinese abbreviation according to the Chinese full name of the present invention comprises three large steps, is respectively to obtain the candidate and be called for short collection, the candidate who gets access to is called for short collection verifies and the result after the checking is done aftertreatment that the below describes them respectively.
Paper obtains the part that the candidate is called for short collection, because the structure of the anchor corpus that different query patterns gets access to is different, thereby cause extracting the specific algorithm difference that the candidate is called for short, again because query pattern 2 is the expansions to query pattern 1, so it is the same to utilize query pattern 1 and query pattern 2 to obtain the method that the candidate is called for short, but with to utilize query pattern 3 to obtain the method that the candidate is called for short different, below separate introduction.
As shown in Figure 2, utilizing query pattern 1 or query pattern 2 to produce candidates, to be called for short the specific implementation step of collection as follows:
Step 1-1, user input known Chinese full name Fn;
Step 1-2, according to query pattern 1: " Fn abbreviation " or query pattern 2: " Fn* abbreviation " constructs concrete query term.
Step 1-3, query term is submitted in the Google search engine searches for, N item anchor text is as the anchor language material before preserving.
Step 1-4, by regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 1-5, the candidate who utilizes algorithm CAEA1 to extract with tag from full abbreviation language material are called for short collection.
Step 1-6, utilize An right margin vocabulary to determine that again the candidate is called for short the right margin that concentrated candidate is called for short.
In above-mentioned step 1-1, also can input the document that comprises a collection of full name, want repeated execution of steps 1-2 to step 1-6 for each Fn in the document this moment, is called for short collection to obtain its corresponding candidate.
In above-mentioned step 1-3, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
In above-mentioned step 1-4, by analyzing full abbreviation language material, we find entirely to be called for short sentence certain structure, is divided into six types so will entirely be called for short sentence according to the difference of structure: half label type, rear portion somatotype, All-in-One type, label are to type, without prefix type with prefix type is arranged.The candidate who extracts from this full abbreviation sentence of six types is called for short, and its type is the corresponding full type that is called for short sentence.
Half label type: Yi Bian the right and left of Can only has matching symbol is arranged, illustrate that this sentence does not probably comprise complete An.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate (be called for short</em〉" height<b 〉.The reason that produces this mistake is intactly not obtain whole sentence when obtaining the anchor language material.
The rear portion somatotype: be called for short in the sentence complete, Fn is the rear section of another full name " * Fn ", so Can also is the rear section of abbreviation " * Can " corresponding to " * Fn ", because excessively reduction, Can probably is not the abbreviation of Fn.For example, utilize query pattern 1 inquiry Fn=" pleural effusion ", entirely be called for short sentence: suppurative<em pleural effusion (be called for short</em〉pyothorax).In upper full an abbreviation in the sentence, " pyothorax " is the abbreviation of " suppurative pleural effusion ", but because excessively reduction, " chest " is not the abbreviation of " pleural effusion ".The problem that does not have in some cases excessive reduction for example, is utilized query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely is called for short sentence: the People's Republic of China (PRC)<em〉Supreme People's Procuratorate (be called for short</em〉Chinese the Supreme People's Procuratorate).In upper full an abbreviation in the sentence, " Chinese the Supreme People's Procuratorate " is the abbreviation of " Supreme People's Procuratorate of the People's Republic of China (PRC) ", but wherein " the Supreme People's Procuratorate " also is the abbreviation of " Supreme People's Procuratorate ".So we need further research how to judge not excessively reduction.
All-in-One type: Fn composition as a whole occurs with other full name, and whole abbreviation is that the combination type of several full name is called for short.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence: the Supreme People's Court and<em the Supreme People's Procuratorate (be called for short</em〉two height).In upper full an abbreviation in the sentence, " Supreme People's Procuratorate " and " Supreme People's Court " form a whole, and " two height " is whole abbreviation.The structure of this language material have obvious characteristic a: Fn be have before whole decline and the Fn " with ", " with ", the conjunction such as " reaching ".
Label is to type: the Fn front is without Chinese character, and Can is paired symbol and marks, and need not to utilize algorithm to determine the border of Can, directly extraction again.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate (be called for short</em〉" the Supreme People's Procuratorate ").
Without prefix type: the Fn front is without Chinese character, and Can is not paired symbol and marks, and Can need not to determine left margin, but needs decide right margin.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence:<em the Supreme People's Procuratorate is called for short</em〉the Supreme People's Procuratorate is found in 1954.
Prefix type is arranged: there is Chinese character the Fn front, and Can need to determine left margin and right margin.For example, utilize query pattern 1 inquiry Fn=" Supreme People's Procuratorate ", entirely be called for short sentence: Jia Chunwang is elected as<em〉Supreme People's Procuratorate (be called for short</em〉the Supreme People's Procuratorate) chief procurator.
In above-mentioned step 1-5, the particular content of algorithm CAEA1 is as follows:
The candidate is called for short extraction algorithm 1: (candidate abbreviation extract algorithm
CAEA1)
Input: entirely be called for short sentence
Fa_sent
Output: the candidate of belt type mark is called for short
Can
Step1: Will
Fa_sentResolve into
Before,
FnWith
Can_sentThree parts, wherein
FnKnown full name,
BeforeTo be positioned in full the abbreviation in the sentence
FnThe Chinese character string of front,
Can_sentAt the full Chinese character string that is positioned at " abbreviation " back in the sentence that is called for short.
Can_sentWord list be shown
Can_sent=
P 1 P 2 P n , wherein
P i Represent a Chinese character.Definition
Can Can_sentIn left margin
Left=1And right margin
Right=n, definition
CanType mark
Tag=null
Step2: Can_sentThe left side is the pairing label
AndThe right is not corresponding pairing label
Then TagHalf label type
end if
Step3: if
before = null
if tag = null
Then TagWithout prefix type
endif
Turn step6
end if
if before!= null
and tag = null
Then TagPrefix type is arranged
end if
Step4:If
BeforeThe last character be " with " or " with " or " reaching "
thenfor each
P
i
∈{P
1
P
2
……P
n
}
If P i Do not exist
FnMiddle appearance
Then TagThe All-in-One type
Turn step5
end if
end for each
end if
Step5: for each
P
i
∈{P
1
P
2
……P
n
}
If P i Do not exist
FnMiddle appearance
And P i BeforeMiddle appearance
then left i+1
end if
If P i FnMiddle appearance
break;
end if
end for each
if left>1
Then TagThe rear portion somatotype
end if
Step6:If
Can_sentBy label to marking
And Tag=without prefix type
Then TagLabel is to type
end if
Step7: for each
P
i
∈{P
left
P
left+1
……P
n-1
}
If P i FnLast participle in occur
And P I+1 Do not exist
FnMiddle appearance
then right i
Will
P i A word on the right joins in the An right margin vocabulary to be verified
end if
end for each
Step8:
can P
left
P
left+1
……P
right
Return
can
In above-mentioned step 1-6, An right margin vocabulary is to be generated through artificial checking by An right margin vocabulary to be verified, in algorithm CAEA1 An right margin vocabulary to be verified is added dynamically.
As shown in Figure 3, utilizing query pattern 3 to produce candidates, to be called for short the specific implementation step of collection as follows:
Step 2-1, user input known Chinese full name Fn;
Step 2-2, according to query pattern 3: " full name Fn ", construct concrete query term.
Step 2-3, query term is submitted in the Google search engine searches for, preserve front 100 anchor texts as the anchor language material.
Step 2-4, by the structure regular expression, from the anchor language material, obtain the full abbreviation sentence that comprises query term, preserve as the full language material that is called for short.
Step 2-5, utilize algorithm CAEA2 from full abbreviation language material, to extract the candidate to be called for short, to form the candidate and be called for short collection.
In above-mentioned step 2-1, also can input the document that comprises a collection of full name, want repeated execution of steps 2-2 to step 2-5 for each Fn in the document this moment, is called for short collection to obtain its corresponding candidate.
In above-mentioned step 2-3, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
In above-mentioned step 2-5, the particular content of algorithm CAEA2 is as follows:
The candidate is called for short extraction algorithm 2: (candidate abbreviation extract algorithm
CAEA2)
Input: entirely be called for short sentence
Fa_sent
Output: the candidate is called for short
Can
Step1: Will
Fa_sentResolve into
Can_sent,
FnWith
BehindThree parts, wherein
FnKnown full name,
Can_sentAt the full Chinese character string that is positioned at " full name " front in the sentence that is called for short,
BehindTo be positioned in full the abbreviation in the sentence
FnThe Chinese character string of back.
Step2: Right
Can_sentWith
BehindDifference participle and mark part of speech, word segmentation result is: { P
1P
2P
kAnd { R
1R
2R
n, definition
Can Can_sentIn one-level left margin subscript
Left1=1, secondary left margin subscript
Left2=1, the left margin subscript
Left=1With the right margin subscript
Right=kThe definition verb can intercept sign flag_v=0, and right margin can intercept sign flag_right=0 according to part of speech.
Step3: P
i∈ {P
1P
2……P
k}
IfP
iWith fn identical word is arranged
ThenFlag_v 1; //P
iVerb afterwards all cannot be as left margin
end if
IfP
iWith fn identical word and is arranged
Left2=1
ThenLeft2 i; // P
iIt may be first participle of can
end if
IfP
iPart of speech be " conjunction " or " preposition " or " auxiliary word "
then left1 i+1;
end if
IfP
iPart of speech be " verb " and flag_v=0
then left1 i+1;
end if
end for each
Step4: for each
P
j∈ {P
kP
k-1……P
1}
IfP
jWith fn identical word is arranged
ThenFlag_right 1; // Pj may be the participle of can
end if
IfP
jPart of speech be " conjunction " or " preposition " or " auxiliary word " or " verb "
and flag_right = 0
then right j-1;
end if
IfP
jWith behind identical word is arranged
AndP
jWith fn without identical word
then right j-1;
end if
IfP
jBe punctuation mark
then right j-1;
end if
end for each
Step5: if
left2 <= right
then left left2
end if
if left1 <= right
then left left1
end if
Step6: return can {P
left……P
right}
Obtain the candidate by aforesaid operations and be called for short collection, the below's discussion is called for short concentrated candidate's abbreviation to the candidate and verifies that with reference to figure 4, its specific implementation step is as follows:
Step 6-1, the axiom of constraint 1-5 checking candidate who utilizes axiom of constraint to concentrate are called for short each concentrated candidate and are called for short.
Step 6-2, the candidate is called for short concentrated candidate is called for short and carries out the classification of Constraint-based collection of functions.
Step 6-3, structure are called for short graph of a relation entirely, utilize full abbreviation graph of a relation that the candidate is called for short each concentrated candidate's abbreviation and verify.
Step 6-4, be called for short tag classification, class categories and the constraint function collection generates the decision tree (see figure 5) by the candidate, utilizing decision tree that the candidate is called for short concentrated candidate's abbreviation classifies, removing classification is candidate's abbreviation of " F ", and retention class is that the candidate of " T " is called for short.
In above-mentioned step 6-1, be called for short each concentrated candidate for the candidate and be called for short Can, whether checking Fn and Can satisfy the constraint requirements of axiom 1-4, if do not satisfy then this candidate's abbreviation is wrong.
In above-mentioned step 6-2, the concrete grammar of classification is as follows: according to being called for short whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to Fn at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type (seeing Table 1) according to An to the covering center of gravity of Fn.
The type that form 1 is called for short
The condition that concrete criteria for classification and all kinds of abbreviation need to satisfy (seeing Table 2).
The criteria for classification that form 2 is called for short
Classification | Need satisfied condition |
The strong linguistic context of high frequency is irrelevant | f 1=1 f 2=1 f 3=1 f 11=1 |
The strong linguistic context of low frequency is irrelevant | f 1=1 f 2=1 f 3=1 f 11< 1 |
The weak linguistic context of high frequency is irrelevant | f 1=1 f 2=1 0.823 f 3<1 f 9=1 f 11=1 |
The weak linguistic context of low frequency is irrelevant | f 1=1 f 2=1 0.823 f 3<1 f 9=1 f 11<1 |
Forward direction type linguistic context is relevant | f 1=1 f 2=1 f 3 1 f 4 0.5 |
Type linguistic context placed in the middle is relevant | f 1=1 f 2=1 0.5 f 4 0.5 (f 3 0.823 f 9 1) |
The backward type linguistic context is relevant | f 1=1 f 2=1 f 3 1 f 4 0.5 |
Different order type | f 1=1 f 2=0 f 11=1 |
Different font | f 1 1 f 7 f 10 f 7 0.05 f 9=1 f 11=1)) |
Note, because linguistic context is the concept of a semantic level, so be difficult to judge with computer intelligence ground a candidate is called for short whether linguistic context is relevant, the judgement that utilizes constraint function to be similar to from the word-building rule aspect among the present invention.
In the form 2, the meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest.
In the form 2, the meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest.
In the form 2, the irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest.
In the form 2, the meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn.
In the form 2, the irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the Can and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar.
In the form 2, the meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn.
In the form 2, the meaning directly perceived of different order type: Fn comprises all words among the Can but word order has change, and it is the highest that Can is called for short concentrated frequency the candidate.
In the form 2, the meaning directly perceived of different font: Fn does not comprise all words among the Can but the frequency of Can is very high or to be called for short concentrated relative frequency the candidate very high.
In above-mentioned step 6-3, when input be the number of full name in the full name document of single full name or input less than 1000 the time, this step is not carried out, otherwise, according to above introducing complete full graph of a relation FAG=(F, the A of being called for short of patterning process structure that is called for short graph of a relation, E, f).The concrete grammar that utilizes full abbreviation graph of a relation to verify is as follows:
If,
Then
If, v
iThe abbreviation type be not the linguistic context independent type, then for full name v
kThis candidate is called for short v
iWrong.
In above-mentioned step 6-4, the implication of classification " F " is mistake, and the implication of classification " T " is correct.
By obtaining the abbreviation collection of known full name after the above-mentioned checking, the below discusses and sorts to being called for short concentrated abbreviation.
In the present invention, according to priority comprehensive function PRI (Cfn, An) concentrates of a sort abbreviation to sort to being called for short.
PRI (Cfn, An) is defined as follows:
Wherein,
,
Be the weight that each function is taked when the comprehensive evaluation, F
iWith
Between corresponding relation see Table 4,
Size obtain by experiment according to the degree of restraint of each function to full abbreviation relation:
Form 3
Numbering | The function content | The function weight |
F 1 | The word of Can is from the ratio among the Fn | 0.12 |
F 2 | The word order of Fn and Can | 0.08 |
F 3 | Can is to the word-coverage rate of Fn | 0.06 |
F 4 | Can covers center of gravity to the participle of Fn | 0.08 |
F 5 | The longest continuative participle number that is not covered by Can among the Fn | 0.04 |
F 6 | The length relation of Fn and Can | 0.06 |
F 7 | The frequency that Can occurs in GoogleArchSet (Fn) | 0.10 |
F 8 | The word of Can is from the relative ratios among the Fn | 0.12 |
F 9 | Can is called for short concentrated relative coverage ratio the candidate | 0.10 |
F 10 | Can is called for short concentrated frequency the candidate | 0.12 |
F 11 | The candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein | 0.14 |
For actual effect of the present invention is described, adopts method of the present invention that multidisciplinary full name is looked for being called for short and done great many of experiments.We have randomly drawed 3910 Chinese Fn from multidisciplinary, utilize the present invention to search its An, the results are shown in form 5.
The experimental result that form 4 Fn search An
The Fn number | Get access to the Fn number of An | Get access to the number percent of the Fn of An | The number of all An | Search the accuracy (sampling) of An |
3910 | 3288 | 84.09% | 5321 | 94.81% |
We have randomly drawed 2140 abbreviations and have verified with the joint verification method from above-mentioned experiment, table 5 is results of checking.
The result of form 5 joint verifications
True mark | Y | N | Accuracy rate | Recall rate |
Y | 1745 | 36 | 95.87% | 97.98% |
N | 75 | 284 | 88.75% | 79.11% |
Can draw the following conclusions by experiment: the present invention has preferably effect to obtaining of Chinese abbreviation, and is applied widely, can finely remedy the defective that Chinese abbreviation obtains previous methods.
Embodiment recited above is described preferred implementation of the present invention; be not that the spirit and scope of the present invention are limited; under the prerequisite that does not break away from design concept of the present invention; common engineering technical personnel make technical scheme of the present invention in this area various modification and improvement; all should fall into protection scope of the present invention; the technology contents that the present invention asks for protection all is documented in claims.
Claims (10)
1. method of obtaining Chinese abbreviation from the Web webpage is characterized in that: comprise step:
Step 1, given Chinese full name Fn of input;
Step 2, selection query pattern are constructed query term, query term is submitted in the Google search engine searches for, and N item anchor text is as the anchor language material before preserving;
Step 3, by regular expression, from the anchor language material, obtain out the sentence of the full abbreviation relation that comprises query term, preserve as the full language material that is called for short;
Step 4, utilization are called for short extraction algorithm EAN and extract candidate's abbreviation from full abbreviation language materials, form the candidate and are called for short set;
Step 5, the candidate is called for short set carries out classification based on full abbreviation relation constraint, thereby the candidate who forms with the classification mark is called for short set;
Step 6, the candidate is called for short set carries out based on full abbreviation relation constraint and entirely be called for short the joint verification of graph of a relation, be called for short set thereby form;
Step 7, abbreviation of the same type carries out prioritization in the set to being called for short, thereby forms the orderly abbreviation set with the classification mark.
2. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 1 is characterized in that: in described step 2, if the Query Result that Google returns〉100, then N gets 100, otherwise N gets the number of the Query Result that Google returns.
3. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 2, described query pattern comprises three kinds: query pattern 1: " Fn abbreviation ", query pattern 2: " Fn* abbreviation ", query pattern 3: " full name Fn "; Query pattern 2 is the expansions to query pattern 1, has added between " Fn " and " abbreviation " one " * ", and " * " can mate any one word in the Google inquiry; Because tend to occur the language material of " sinus rhythm " and so in the webpage, this language material can't retrieve with query pattern 1, but utilizes query pattern 2 just can retrieve; Search order is for selecting first query pattern 1, next query pattern 2, last query pattern 3.
4. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 4, be called for short extraction algorithm EAN and comprise two algorithm CAEA1 and CAEA2, when selecting query pattern 1 or query pattern 2 in the step 2, adopt CAEA1 to extract An in the step 4; When selecting query pattern 3 in the step 2, adopt CAEA2 to extract An in the step 4.
5.
A kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 1 or query pattern 2, step 4 and step 5 were carried out following steps:
Steps A-1, the candidate who utilizes algorithm CAEA1 to extract with tag from full abbreviation language material are called for short collection;
Steps A-2, utilize An right margin vocabulary to determine that again the candidate is called for short the right margin that concentrated candidate is called for short;
In steps A-2, An right margin vocabulary is to be generated through artificial checking by An right margin vocabulary to be verified, in algorithm CAEA1 An right margin vocabulary to be verified is added dynamically;
In above-mentioned steps 3, entirely be called for short in the language material the full sentence that is called for short and be divided into six types: half label type, rear portion somatotype, All-in-One type, label are to type, without prefix type with prefix type is arranged; The candidate who extracts from this full abbreviation sentence of six types is called for short, and its type is the corresponding full type that is called for short sentence;
Half label type: Yi Bian the right and left of Can only has matching symbol is arranged, illustrate that this sentence does not probably comprise complete An; The rear portion somatotype: be called for short in the sentence complete, Fn is the rear section of another full name " * Fn ", so Can also is that " * Fn " is right
The rear section of the abbreviation of answering " * Can ", because excessively reduction, Can probably is not the abbreviation of Fn;
All-in-One type: Fn composition as a whole occurs with other full name, and whole abbreviation is that the combination type of several full name is called for short; The structure of this language material has obvious characteristic a: Fn before whole decline and the Fn conjunction to be arranged;
Label is to type: the Fn front is without Chinese character, and Can is paired symbol and marks, and need not to utilize algorithm to determine the border of Can, directly extraction again;
Without prefix type: the Fn front is without Chinese character, and Can is not paired symbol and marks, and Can need not to determine left margin, but needs decide right margin;
Prefix type is arranged: there is Chinese character the Fn front, and Can need to determine left margin and right margin;
In steps A-1, the particular content of described algorithm CAEA1 is as follows:
The candidate is called for short extraction algorithm 1:(candidate abbreviation extract algorithm CAEA1)
Input: entirely be called for short sentence
Fa_sent
Output: the candidate of belt type mark is called for short
Can
Will
Fa_sentResolve into
Before,
FnWith
Can_sentThree parts, wherein
FnKnown full name,
BeforeTo be positioned in full the abbreviation in the sentence
FnThe Chinese character string of front,
Can_sentAt the full Chinese character string that is positioned at " abbreviation " back in the sentence that is called for short;
Can_sentWord list be shown
Can_sent=
P 1 P 2 P n , wherein
P i Represent a Chinese character;
Definition
Can Can_sentIn left margin
Left=1And right margin
Right=n, definition
CanType mark
Tag=null
If
Can_sentThe left side is that pairing label and the right is not corresponding pairing label
Then
TagHalf label type
end if
if
before = null
if
tag = null
Then
TagWithout prefix type
end if
Turn step6
end if
if
before!= null and
tag = null
Then
TagPrefix type is arranged
end if
If
BeforeThe last character be " with " or " with " or " reaching "
then for each
P i ∈{P 1 P 2 ……P n }
If
P i Do not exist
FnMiddle appearance
Then
TagThe All-in-One type
Turn step5
end if
end for each
end if
for each
P i ∈{P 1 P 2 ……P n }
If
P i Do not exist
FnIn and appears
P i BeforeMiddle appearance
then
left i+1
end if
If
P i FnMiddle appearance
break;
end if
end for each
if
left>1
Then
TagThe rear portion somatotype
end if
If
Can_sentBy label to marking and
Tag=without prefix type
Then
TagLabel is to type
end if
for each
P i ∈{P left P left+1 ……P n-1 }
If
P i FnLast participle in and appears
P I+1 Do not exist
FnMiddle appearance
then
right i
Will
P i A word on the right joins in the An right margin vocabulary to be verified
end if
end for each
can P
left
P
left+1
……P
right
Return
can 。
6. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 4 is characterized in that: when step 2 was selected query pattern 3, step 4 and step 5 were carried out following steps:
Step B-1, utilize algorithm CAEA2 from full abbreviation language material, to extract the candidate to be called for short collection;
The particular content of described algorithm CAEA2 is as follows:
The candidate is called for short extraction algorithm 2:(candidate abbreviation extract algorithm CAEA2)
Input: entirely be called for short sentence
Fa_sent
Output: the candidate is called for short
Can
Will
Fa_sentResolve into
Can_sent,
FnWith
BehindThree parts, wherein
FnKnown full name,
Can_sentAt the full Chinese character string that is positioned at " full name " front in the sentence that is called for short,
BehindTo be positioned in full the abbreviation in the sentence
FnThe Chinese character string of back;
Right
Can_sentWith
BehindDifference participle and mark part of speech, word segmentation result is: { P
1P
2P
kAnd { R
1R
2R
n, definition
Can Can_sentIn one-level left margin subscript
Left1=1, secondary left margin subscript
Left2=1, the left margin subscript
Left=1With the right margin subscript
Right=k
The definition verb can intercept sign flag_v=0, and right margin can intercept sign flag_right=0 according to part of speech;
for each P
i∈ {P
1P
2……P
k}
If P
iWith fn identical word is arranged
Then flag_v 1; //P
iVerb afterwards all cannot be as left margin
end if
If P
iWith fn identical word and is arranged
Left2=1
Then left2 i; // P
iIt may be first participle of can
end if
If P
iPart of speech be " conjunction " or " preposition " or " auxiliary word "
then left1 i+1;
end if
If P
iPart of speech be " verb " and flag_v=0
then left1 i+1;
end if
end for each
for each P
j∈ {P
kP
k-1……P
1}
If P
jWith fn identical word is arranged
Then flag_right 1; // Pj may be the participle of can
end if
If P
jPart of speech be " conjunction " or " preposition " or " auxiliary word " or " verb "
and flag_right = 0
then right j-1;
end if
If P
jWith behind identical word and P is arranged
jWith fn without identical word
then right j-1;
end if
If P
jBe punctuation mark
then right j-1;
end if
end for each
if left2 <= right
then left left2
end if
if left1 <= right
then left left1
end if
return can {P
left……P
right} 。
7. a kind of method of obtaining Chinese abbreviation from the Web webpage according to claim 1 is characterized in that: in the above-mentioned steps 6, if be called for short set for empty, and also have query pattern available in the step 2, then re-execute step 2 to 7; If be called for short set for empty, do not have alternative query pattern in the step 2 simultaneously, then withdraw from, show can not from Web search the abbreviation of given full name.
8. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 1, it is characterized in that: in the above-mentioned steps 6, full abbreviation relation constraint is four-tuple R=(Fn, An, a F, A), wherein, Fn is full name, and An is the abbreviation of Fn, F is the constraint function collection between Fn and the An, and A is the axiom of constraint collection that Fn and An must satisfy; The constraint function collection represents the constraint between Fn and the An quantitatively, and the axiom of constraint collection represents the constraint between Fn and the An qualitatively;
It is complete that to be called for short graph of a relation FAG (Fullname and Abbreviation Graph) be a four-tuple, i.e. FAG=(F, A, E, f), wherein,
The full name collection,
To be called for short collection, F
A is vertex set,
Be the nonoriented edge collection, f is that E is to F
Mapping on the A, namely
, always have the summit
With
, so that
Set up, that is to say
To connect
With
Nonoriented edge.
9. a kind of method of from the Web webpage, obtaining Chinese abbreviation according to claim 8, it is characterized in that: the specific implementation step of described step 6 is as follows:
Step 6-1, the axiom of constraint 1-5 checking candidate who utilizes axiom of constraint to concentrate are called for short each concentrated candidate and are called for short;
Step 6-2, the candidate is called for short concentrated candidate is called for short and carries out the classification of Constraint-based collection of functions;
Step 6-3, structure are called for short graph of a relation entirely, utilize full abbreviation graph of a relation that the candidate is called for short each concentrated candidate's abbreviation and verify;
Step 6-4, be called for short tag classification, class categories and the constraint function collection generates decision tree by the candidate, utilize decision tree that the candidate is called for short concentrated candidate and be called for short and classify, the candidate who removes classification and be " F " is called for short, and retention class is that the candidate of " T " is called for short; The implication of classification " F " is mistake, and the implication of classification " T " is correct;
In above-mentioned step 6-1, be called for short each concentrated candidate for the candidate and be called for short Can, whether checking Fn and Can satisfy the constraint requirements of axiom 1-4, if do not satisfy then this candidate's abbreviation is wrong;
In above-mentioned step 6-2, the concrete grammar of classification is as follows: according to being called for short whether different word or different order are arranged, be divided into plain edition, different font and different order type, whether plain edition is correlated with according to linguistic context again is divided into strong linguistic context independent type, weak linguistic context independent type and linguistic context relationship type, the linguistic context independent type concentrates the relative height of frequency to be divided into high-frequency type and low frequency type according to Fn at full name again, and the linguistic context relationship type is divided into forward direction type, type placed in the middle and backward type according to An to the covering center of gravity of Fn;
The condition that concrete criteria for classification and all kinds of abbreviation need to satisfy is:
The meaning directly perceived that the strong linguistic context of high frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest;
The meaning directly perceived that the strong linguistic context of low frequency is irrelevant: Fn comprises all words among the Can and keeps word order constant, and each participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest;
The irrelevant meaning directly perceived of the weak linguistic context of high frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the highest;
The irrelevant meaning directly perceived of the weak linguistic context of low frequency: Fn comprises all words among the Can and keeps word order constant, and the most of participle among the Fn has correspondence in Can, and Can is called for short the candidate and concentrates frequency the not highest;
The meaning directly perceived that forward direction type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the latter half of Fn;
The irrelevant meaning directly perceived of type linguistic context placed in the middle: Fn comprises all words among the Can and keeps word order constant, and the participle number that the front and rear part is omitted among the Fn is similar;
The meaning directly perceived that the backward type linguistic context is relevant: Fn comprises all words among the Can and keeps word order constant, and the participle that is omitted among the Fn is mostly at the first half of Fn;
The meaning directly perceived of different order type: Fn comprises all words among the Can but word order has change, and it is the highest that Can is called for short concentrated frequency the candidate;
The meaning directly perceived of different font: Fn does not comprise all words among the Can but the frequency of Can is very high or to be called for short concentrated relative frequency the candidate very high;
In above-mentioned step 6-3, when input be the number of full name in the full name document of single full name or input less than 1000 the time, this step is not carried out, otherwise, according to the full graph of a relation FAG=(F, A, E, f) that is called for short of patterning process structure of full abbreviation graph of a relation;
The concrete grammar that utilizes full abbreviation graph of a relation to verify is as follows:
If,
Then
If, v
iThe abbreviation type be not the linguistic context independent type, then for full name v
kThis candidate is called for short v
iWrong;
By obtaining the abbreviation collection of known full name after the above-mentioned checking, the below sorts to being called for short concentrated abbreviation;
According to priority comprehensive function PRI (Fn, Can) concentrates of a sort abbreviation to sort to being called for short;
PRI (Fn, Can) is defined as follows:
10. require 8 or 9 described a kind of methods of obtaining Chinese abbreviation from the Web webpage according to claim, it is characterized in that: the concrete meaning of described constraint function collection is:
The word of constraint function 1:Can is from the ratio among the Fn
Each Chinese character among the Can comes among the Fn, is called for short concentratedly the candidate, and it is higher to appear at the priority that the higher candidate of the ratio of the word among the Fn is called for short;
The formal definition of constraint function 1 and being calculated as follows:
The word order of constraint function 2:Fn and Can
The order of the word among the Can is strictly arranged sequentially by what occur in Fn;
Fn is identical with the Can word order, and all words that containing among the Can all appear among the Fn, if the word that does not appear among the Fn is arranged among the Can, then the value of constraint function 2 is 0;
Constraint function 3:Can is to the word-coverage rate of Fn
It is more that the candidate is called for short the participle that covers full name, just more may become correct abbreviation;
The formal definition of constraint function 3 and being calculated as follows:
Constraint function 4:Can covers center of gravity to the participle of Fn
Full name is comprised of a plurality of participles usually, one or more participles in the situation about having in the full name can be omitted in the candidate is called for short, but the participle that is omitted should be evenly distributed in the full name, and should all not concentrate on forward part or the rear section of full name, namely among the Fn abridged participle respectively in forward part, center section and the rear section of Fn;
The formal definition of constraint function 4 and being calculated as follows:
Wherein,
Corresponding
If among the Fn by Can
1The participle that covers is evenly distributed among the Fn, and among the Fn by Can
2The participle that covers all is distributed in the first half of Fn; According to constraint function 4,
So, Can
1Priority ratio Can
2Priority high;
The longest continuative participle number that is not covered by Can among the constraint function 5:Fn
The candidate is called for short usually and is comprised of a plurality of participles, one or more participles in the situation about having in the full name can be omitted in abbreviation, but the participle that is omitted can not occur in full name usually continuously, namely the participle in the full name continuously in abbreviation the abridged probability smaller;
The formal definition of constraint function 5 and being calculated as follows:
Wherein, N represents the not number of capped participle string contained among the Fn
The length relation of constraint function 6:Fn and Can
The candidate is called for short corresponding full name length at the 1.5-5 that is called for short length for the candidate doubly, and the probability that full name length exceeds this scope is less;
The formal definition of constraint function 6 and being calculated as follows:
The frequency that constraint function 7:Can occurs in GoogleArchSet (Fn)
Searched to the Google when being called for short by full name, the priority of candidate's abbreviation that occurrence frequency is higher among GoogleArchSet (Fn) is higher;
The formal definition of constraint function 7 and being calculated as follows:
When searching An by Fn, obtain sometimes several candidates and be called for short, they consist of the candidate and are called for short collection CanSet (Fn), are called for short Can for any one candidate among the CanSet (Fn)
i, analyze FA(Fn, Can
i) time can analogy CanSet (Fn) in the desired value that is called for short of other candidate;
4 following constraint functions are based on the candidate and are called for short the collection definition;
The word of constraint function 8:Can is from the relative ratios among the Fn
Compare with constraint function 1, the constraint function Final 8 transfers the relativity of candidate's abbreviation in CanSet (Fn);
The formal definition of constraint function 8 and being calculated as follows:
Constraint function 9: the candidate at Fn is called for short the relative coverage ratio of concentrating Fn
Compare with constraint function 3, constraint function 9 is emphasized the relativity of Can in CanSet (Fn), and when the candidate is called for short when high to the coverage rate of full name, the priority that the candidate that coverage rate is relatively high so is called for short is higher;
The formal definition of constraint function 9 and being calculated as follows:
Constraint function 10:Can is called for short concentrated frequency the candidate
When searching Can by Fn, sometimes the candidate to be called for short the frequency of concentrating all candidates to be called for short all very low, the effect of contraction of constraint function 7 is just desalinated so, so constraint function 9 is considered the relative frequency that each candidate is called for short, be called for short concentratedly the candidate, the priority that the relatively high candidate of frequency is called for short is higher;
The formal definition of constraint function 10 and being calculated as follows:
Constraint function 11: the candidate be called for short concentrated element according to the frequency ascending sort after, Can relative position therein is when the candidate is called for short concentrated element is many, the candidate's that frequency is lower importance is relatively low;
The formal definition of constraint function 11 and being calculated as follows:
The importance that the candidate that the value of constraint function 11 is lower is called for short is lower;
The concrete meaning of described axiom of constraint collection is:
Axiom of constraint 1: the long axiom that do not wait of word
Meaning directly perceived: be called for short in the relation complete, the number of words of Fn must be greater than the number of words of Can;
Axiom of constraint 2: indicative mood axiom
Form represents:
Do not comprise interrogative among meaning: Fn directly perceived and the Can;
Axiom of constraint 3: form does not repeat axiom
Meaning directly perceived: be called for short in the relation complete, Fn and Can cannot be the Chinese character strings of ss form, and wherein s is Chinese character string;
Axiom of constraint 4: semanteme does not repeat axiom
Form represents:
Meaning directly perceived: the Chinese character that all appear among the Fn, the number of times that occurs in Fn must be not less than the number of times that occurs in Can;
Axiom of constraint 5: do not make a general reference axiom
Meaning directly perceived: the candidate is called for short corresponding full name should be less than or equal to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102531213A CN102955819A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring shortened form in Chinese from Web page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011102531213A CN102955819A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring shortened form in Chinese from Web page |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102955819A true CN102955819A (en) | 2013-03-06 |
Family
ID=47764630
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011102531213A Pending CN102955819A (en) | 2011-08-31 | 2011-08-31 | Method for acquiring shortened form in Chinese from Web page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102955819A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956192A (en) * | 2016-06-15 | 2016-09-21 | 中国互联网络信息中心 | Method and system for acquiring shortened form of organization name based on website homepage information |
CN107577655A (en) * | 2016-07-05 | 2018-01-12 | 北京国双科技有限公司 | Name acquiring method and apparatus |
CN110502685A (en) * | 2019-08-02 | 2019-11-26 | 阿里巴巴集团控股有限公司 | A kind of data optimization methods based on search engine, device and equipment |
CN113220863A (en) * | 2021-07-07 | 2021-08-06 | 企查查科技有限公司 | Extraction method, device and storage medium for company effective abbreviation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
-
2011
- 2011-08-31 CN CN2011102531213A patent/CN102955819A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
Non-Patent Citations (2)
Title |
---|
GUANG JIANG: ""A General Approach to Extracting Full Names and Abbreviations for Chinese Entities from the Web"", 《 INTELLIGENT IFIP ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY》 * |
GUOGANG TIAN: ""MFC: A Method of Co-referent Relation Acquisition from Large-Scale Chinese Corpora"", 《LECTURE NOTES IN COMPUTER SCIENCE》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956192A (en) * | 2016-06-15 | 2016-09-21 | 中国互联网络信息中心 | Method and system for acquiring shortened form of organization name based on website homepage information |
CN107577655A (en) * | 2016-07-05 | 2018-01-12 | 北京国双科技有限公司 | Name acquiring method and apparatus |
CN110502685A (en) * | 2019-08-02 | 2019-11-26 | 阿里巴巴集团控股有限公司 | A kind of data optimization methods based on search engine, device and equipment |
CN113220863A (en) * | 2021-07-07 | 2021-08-06 | 企查查科技有限公司 | Extraction method, device and storage medium for company effective abbreviation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271626B (en) | Text semantic analysis method | |
CN108763333B (en) | Social media-based event map construction method | |
CN105426539B (en) | A kind of lucene Chinese word cutting method based on dictionary | |
Alzahrani et al. | Fuzzy semantic-based string similarity for extrinsic plagiarism detection | |
Zhang et al. | An empirical study of TextRank for keyword extraction | |
CN105868313A (en) | Mapping knowledge domain questioning and answering system and method based on template matching technique | |
CN104239286A (en) | Method and device for mining synonymous phrases and method and device for searching related contents | |
CN102254014A (en) | Adaptive information extraction method for webpage characteristics | |
CN101093478A (en) | Method and system for identifying Chinese full name based on Chinese shortened form of entity | |
CN109614620B (en) | HowNet-based graph model word sense disambiguation method and system | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
CN104598441B (en) | A kind of method that computer splits Chinese sentence | |
CN104361059A (en) | Harmful information identification and web page classification method based on multi-instance learning | |
CN102955819A (en) | Method for acquiring shortened form in Chinese from Web page | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
CN111221976A (en) | Knowledge graph construction method based on bert algorithm model | |
Celebi et al. | Segmenting hashtags using automatically created training data | |
Rondon et al. | Never-ending multiword expressions learning | |
CN103544167A (en) | Backward word segmentation method and device based on Chinese retrieval | |
CN102955818A (en) | Method for acquiring full names in Chinese from Web page | |
Zamin et al. | A statistical dictionary-based word alignment algorithm: An unsupervised approach | |
Tissot et al. | Fast phonetic similarity search over large repositories | |
Sinha et al. | Hindi-English language identification, named entity recognition and back transliteration: shared task system description | |
Saleh et al. | Semantic kernels for semantic parsing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130306 |