CN103116573B

CN103116573B - A kind of automatic extending method of domain lexicon based on vocabulary annotation

Info

Publication number: CN103116573B
Application number: CN201310046647.3A
Authority: CN
Inventors: 黄河燕; 史树敏; 朱朝勇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-02-06
Filing date: 2013-02-06
Publication date: 2015-10-28
Anticipated expiration: 2033-02-06
Also published as: CN103116573A

Abstract

The present invention relates to a kind of automatic extending method of domain lexicon based on vocabulary annotation, belong to natural language processing technique field.The steps include: 1. by the degree of correlation between field belonging to analysis field dictionary, generate a domain classification tree.2. for each domain lexicon to be expanded obtains a training set.3. pre-service is carried out to training set, obtain language material feature set.4. the number comprising the language material feature set of a certain vocabulary in number of times that in language material feature set corresponding to each node, each vocabulary occurs in this language material feature set and language material feature set corresponding to its child node is added up.5. the degree of confidence of each vocabulary in each language material feature set is calculated.6. new term is joined in domain lexicon to be expanded.The automatic extending method of domain lexicon based on vocabulary annotation that the present invention proposes does not need manually to collect domain corpus, therefore avoids the impact of limitation by the quality and scale of domain corpus and the non-equilibrium property of domain corpus.

Description

A kind of automatic extending method of domain lexicon based on vocabulary annotation

Technical field

The present invention relates to a kind of automatic extending method of domain lexicon, particularly a kind of automatic extending method of domain lexicon based on vocabulary annotation, belongs to natural language processing technique field.

Background technology

Domain lexicon (Domain Dictionary) refers to the set of the distinctive term of specific area or expression way.Domain lexicon is the basic resources of natural language processing, domain knowledge is widely used in the link such as word sense disambiguation, syntactic analysis of the multiple-tasks such as mechanical translation, information retrieval, data mining and text classification, and the scale of domain lexicon and quality are directly connected to the performance of related application.

The structure of domain lexicon and extending method can be divided three classes according to automaticity: based on the artificial constructed of expertise and extending method, and Semi-Automatic Generation and extending method generate and extending method with full-automatic.Artificial constructed high with extending method accuracy rate, but need a large amount of domain experts to participate in for a long time, and cost of labor and time cost are too high, and lack real-time.Full-automatic generation and extending method are by analyzing the difference of vocabulary statistical property in different field corpus, and judge the domain attribute of vocabulary, the method, without the need to the participation of domain expert, saves a large amount of costs of labor, but the accuracy rate that dictionary is included is not high.Automanual generation and extending method, between artificial writing and automatic Generation, specify a small amount of domain knowledge by domain expert, realize the automatic expansion of domain lexicon.Existing semi-automatic and full automatic domain lexicon method needs the support of domain corpus mostly, the quality of the domain lexicon generated depends on the quality of adopted domain corpus, the completeness of domain lexicon is subject to the restriction of domain corpus scale, simultaneously, consider the impact of the non-equilibrium property of corpus, the field mark of word is easier to corpus sweeping field deflection.Above-mentioned two kinds of methods all fail to effectively utilize existing dictionary resources, and the correlativity between non-consideration field.

Summary of the invention

The object of the invention is, for the deficiency of the automatic extending method existence of existing domain lexicon at present, to propose a kind of automatic extending method of domain lexicon based on vocabulary annotation.

The object of the invention is to be achieved through the following technical solutions.

Based on the automatic extending method of domain lexicon of vocabulary annotation, its concrete operation step is:

Step one, by the degree of correlation between field belonging to analysis field dictionary, generate domain classification tree.Be specially:

Step 1.1: represent pending node set with symbol D, and set the original state of pending node set as sky;

Step 1.2: each domain lexicon to be expanded is put in pending node set as a node.Nodename is the title of this domain lexicon, and node content is the whole entries in this domain lexicon; Described entry comprises the explain information of vocabulary and this vocabulary.

Step 1.3: calculated the degree of correlation between field belonging to the domain lexicon representated by any two nodes in pending node set respectively by formula (1), with symbol R (d ₁, d ₂) represent.

R (d_{1}, d_{2}) = \frac{| d_{1} \cap d_{2} |}{\min (| d_{1}, d_{2} |)} - - - (1)

Wherein, R (d ₁, d ₂) represent that in pending node set, a certain domain lexicon (uses symbol D ₁represent) affiliated field (use symbol d ₁represent) and another domain lexicon (use symbol D ₂represent) affiliated field (use symbol d ₂represent) the degree of correlation; | d ₁∩ d ₂| represent domain lexicon D ₁with domain lexicon D ₂the number of the identical vocabulary comprised; Min (| d ₁, d ₂|) represent domain lexicon D ₁with domain lexicon D ₂the vocabulary number that the domain lexicon of middle negligible amounts comprises.

Step 1.4: the degree of correlation R (d between the domain lexicon representated by any two nodes in the pending node set obtained from step 1.3 ₁, d ₂) in find out maximal value, use symbol R _maxrepresent; This maximal value R _maxtwo corresponding domain lexicon use symbol D respectively ₁' and D ₂' represent, domain lexicon D ₁' and D ₂' affiliated field use symbol d respectively ₁' and d ₂' represent, domain lexicon D ₁' and D ₂' in content use symbol c respectively ₁and c ₂represent.

Step 1.5: by domain lexicon D ₁' and D ₂' in entry and also, and give and and after the new title of dictionary definition one, use D _newrepresent; Should and and after dictionary D _newcontent symbol c _newrepresent, c _new=c ₁∪ c ₂.Then set up a new node, the name of new node is called D _new, the content of new node is c _new.Domain lexicon D ₁' and D ₂' as node D _newchild node.

Step 1.6: by new node D _newjoin in pending node set, and by node D ₁' and D ₂' delete from pending node set.

Step 1.7: the number of adding up pending node set interior joint, represents with symbol N.If N >=2, then turn back to step 1.3; Otherwise, end operation.

Through the operation of above-mentioned steps, namely obtain a domain classification tree.

Step 2, obtain a training set for each domain lexicon to be expanded.

This step can with step one synchronous operation: determine that is with the universal electric dictionary annotated, then for the vocabulary in each domain lexicon to be expanded, be done as follows respectively: from the universal electric dictionary of band annotation, search each vocabulary in this domain lexicon successively, then annotation corresponding for each vocabulary is put into training set corresponding to this field as a training data, the training set in this field can be obtained.

Through the operation of step 2, a corresponding domain lexicon to be expanded, can obtain the training set that belonging to a domain lexicon to be expanded, field is corresponding.

Step 3, pre-service is carried out to training set, obtain language material feature set.

On the basis of step 2 operation, successively pre-service is carried out to the corpus in the training set of each domain lexicon to be expanded, obtain the language material feature set that the training set in this field is corresponding, be specially: the every bar training data in the training set in some fields is carried out to participle, phrase extraction, lemmatization and goes the pre-service such as stop words, obtain one group of vocabulary that this training data is corresponding, be called language material character subset.The set of the language material character subset that the whole training datas in the training set in this field are corresponding is called the language material feature set that this domain lexicon is corresponding.

Step 4, on the basis of step one and step 3, for the leaf node on the domain classification tree that step one obtains, add up the number of times that in language material feature set corresponding to each leaf node, each vocabulary occurs in this language material feature set.For non-leaf nodes, first the language material feature set of the child node of each non-leaf nodes to be carried out and also, using the result that merges as the language material feature set of this non-leaf nodes, then add up following data: the number of times that in the language material feature set of 1. this non-leaf nodes, each vocabulary occurs in the language material feature set of this non-leaf nodes; 2. for each vocabulary in the language material feature set of this non-leaf nodes, the number of the language material feature set of this vocabulary in the language material feature set that the child node of this non-leaf nodes is corresponding, is comprised.

Step 5, step 4 operation basis on, calculate the degree of confidence of each vocabulary in each language material feature set according to formula (2).

wdc = \frac{wd}{Σwd} \times \log (\frac{wd}{dt} + 1) - - - (2)

Wherein, wdc represents the degree of confidence of the some vocabulary (representing with symbol w) in the language material feature set that a certain field (representing with symbol d) is corresponding; Wd represents the number of times that vocabulary w occurs in the d of field; Σ wd represents the total degree occurred in the language material feature set that the father node of the corresponding node of the language material feature set at vocabulary w place is corresponding; Dt represents the number comprising the language material feature set of this vocabulary w in the language material feature set that the brotgher of node of the corresponding node of the language material feature set at vocabulary w place is corresponding.

Step 6, new term to be joined in domain lexicon to be expanded.

On the basis of step 5 operation, using the vocabulary with newly including in the universal electric dictionary of annotation described in step 2 as new term, add in domain lexicon to be expanded, concrete operation step is:

Step 6.1: the annotation of new term is carried out to participle, phrase extraction, lemmatization and goes the pre-service such as stop words, obtains one group of vocabulary that this vocabulary annotation is corresponding, represents the quantity of this group vocabulary with n.

Step 6.2: using the root node in the classification tree of field as present node.

Step 6.3: according to formula (3) calculate successively new term set with domain classification in field corresponding to each child node of present node between degree of membership, and find out maximal value wherein, use symbol sdc _maxrepresent.

{sdc}_{k} = m_{k} \times Π_{j = 1}^{n} {wdc}_{jk} - - - (3)

Wherein, sdc _krepresent new term set with domain classification in degree of membership between field (representing with symbol k) corresponding to each child node of present node; wdc _jkrepresent the degree of confidence of a jth vocabulary and field k in one group of vocabulary that new term annotation is corresponding; m _krepresent in n the vocabulary that new term annotation is corresponding, in the number that the degree of confidence of field k is the highest.

Step 6.4: if the maximal value sdc of degree of membership that step 6.3 obtains _maxbe greater than preassigned threshold value, then judge this maximal value sdc further _maxwhether corresponding node is leaf node, if leaf node, then new term is added in domain lexicon corresponding to this node; If not leaf node, then by this maximal value sdc _maxcorresponding node, as present node, then turns back to step 6.3.If the maximal value sdc of the degree of membership that step 6.3 obtains _maxbe not more than preassigned threshold value, then using new term as popular word, do not add in any one domain lexicon to be expanded, end operation.

Through the operation of above-mentioned steps, the automatic expansion to domain lexicon can be realized.

Beneficial effect

The automatic extending method of domain lexicon that the present invention proposes to annotate based on vocabulary is compared with the automatic extending method of existing domain lexicon, its advantage does not need manually to collect domain corpus, therefore avoids the impact of limitation by the quality and scale of domain corpus and the non-equilibrium property of domain corpus.

Accompanying drawing explanation

Fig. 1 is the domain classification tree in the specific embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

Common factor between the lexical information of communication in the mechanical dictionary of Huajian, aviation, machinery and computing machine four domain lexicon and dictionary is as shown in table 1.In table 1, in the domain lexicon in communication, aviation, machinery and computing machine four fields, comprise 12626 vocabulary, 7592 vocabulary, 19250 vocabulary, 5156 vocabulary respectively.The common factor quantity of communication and aviation field dictionary is 4432; The common factor quantity of communication and mechanical field dictionary is 6210; The common factor quantity of communication and computing machine is 2705; The common factor quantity of aviation and mechanical field dictionary is 4908; The common factor quantity of aviation and computer realm dictionary is 2064; The common factor quantity of machinery and computing machine is 2383.

Common factor information slip between the lexical information of table 1 four domain lexicon and dictionary

	Communication	Aviation	Machinery	Computing machine
					Communication	12626	4432	6210	2705
Aviation	4432	7592	4908	2064
					Machinery	6210	4908	19250	2383
Computing machine	2705	2064	2383	5156

The automatic extending method of domain lexicon based on vocabulary annotation using the present invention to propose expands automatically to communication in the mechanical dictionary of Huajian, aviation, machinery and computing machine four domain lexicon, and its concrete operation step is:

Step 1.1: set the original state of pending node set D as empty;

Step 1.2: " communication ", " aviation ", " machinery " and " computing machine " four domain lexicon are put in pending node set respectively as a node.Nodename is the title of this domain lexicon, and node content is the whole entries in this domain lexicon; Described entry comprises the explain information of vocabulary and this vocabulary.

Step 1.3: calculated the degree of correlation R (d between field belonging to the domain lexicon representated by any two nodes in pending node set by formula (1) respectively ₁, d ₂).

Step 1.4: be aviation and machinery by two fields calculating the known degree of correlation the highest.

Step 1.5: aviation and machinery are merged into a node, calculates new node " aviation & machinery " respectively with computing machine and the degree of correlation communicated

Step 1.6: new node " aviation & machinery " is joined in pending node set, and " aviation " and " machinery " is deleted from pending node set.

Step 1.7: the number of pending node set interior joint is 3, then repeats step 1.3 to 1.7.Until only have a node in pending node set, a domain classification tree can be obtained, as shown in Figure 1.The root node Root of domain classification tree has two child nodes, is " aviation & machinery " and " communicate & computing machine " respectively; Having two child nodes under node " aviation & machinery ", is " aviation " and " machinery " respectively; Having two child nodes under node " communication & computing machine ", is " communication " and " computing machine " respectively.

Step 2, obtain a training set for each domain lexicon to be expanded.

Step 6, new term to be joined in domain lexicon to be expanded.

Step 6.3: according to formula (3) calculate successively new term set with domain classification in field corresponding to each child node of present node between degree of membership, and find out maximal value sdc wherein _max.

Step 6.4: if the maximal value sdc of degree of membership that step 6.3 obtains _maxbe greater than preassigned threshold value 0.7, then judge this maximal value sdc further _maxwhether corresponding node is leaf node, if leaf node, then new term is added in domain lexicon corresponding to this node; If not leaf node, then by this maximal value sdc _maxcorresponding node, as present node, then turns back to step 6.3.If the maximal value sdc of the degree of membership that step 6.3 obtains _maxbe not more than preassigned threshold value, then using new term as popular word, do not add in any one domain lexicon to be expanded, end operation.

Claims

1., based on the automatic extending method of domain lexicon of vocabulary annotation, it is characterized in that: its concrete operation step is:

Step one, by the degree of correlation between field belonging to analysis field dictionary, generate domain classification tree; Be specially:

Step 1.2: each domain lexicon to be expanded is put in pending node set as a node; Nodename is the title of this domain lexicon, and node content is the whole entries in this domain lexicon; Described entry comprises vocabulary and this vocabulary annotation;

Step 1.3: calculated the degree of correlation between field belonging to the domain lexicon representated by any two nodes in pending node set by formula (1) respectively;

R (d_{1}, d_{2}) = \frac{| d_{1} \cap d_{2} |}{\min (| d_{1}, d_{2} |)} - - - (1)

Wherein, R (d ₁, d ₂) represent a certain domain lexicon D in pending node set ₁affiliated field d ₁with another domain lexicon D ₂affiliated field d ₂the degree of correlation; | d ₁∩ d ₂| represent domain lexicon D ₁with domain lexicon D ₂the number of the identical vocabulary comprised; Min (| d ₁, d ₂|) represent domain lexicon D ₁with domain lexicon D ₂the vocabulary number that the domain lexicon of middle negligible amounts comprises;

Step 1.4: the degree of correlation R (d belonging to the domain lexicon representated by any two nodes in the pending node set obtained from step 1.3 between field ₁, d ₂) in find out maximal value, use symbol R _maxrepresent; This maximal value R _maxtwo corresponding domain lexicon use symbol D ' respectively ₁with D ' ₂represent, domain lexicon D ' ₁with D ' ₂affiliated field use symbol d ' respectively ₁with d ' ₂represent, domain lexicon D ' ₁with D ' ₂in content use symbol c respectively ₁and c ₂represent;

Step 1.5: by domain lexicon D ' ₁with D ' ₂in entry merge, and give the new title of dictionary definition one after merging, use D _newrepresent; Dictionary D after this merging _newcontent symbol c _newrepresent, c _new=c ₁∪ c ₂; Then set up a new node, the name of new node is called D _new, the content of new node is c _new; Domain lexicon D ' ₁with D ' ₂as node D _newchild node;

Step 1.6: by new node D _newjoin in pending node set, and by node D ' ₁with D ' ₂delete from pending node set;

Step 1.7: the number of adding up pending node set interior joint, represents with symbol N; If N >=2, then turn back to step 1.3; Otherwise, end operation;

Through the operation of above-mentioned steps, namely obtain a domain classification tree;

Step 2, obtain a training set for each domain lexicon to be expanded;

This step and step one synchronous operation: determine that is with the universal electric dictionary annotated, then for the vocabulary in each domain lexicon to be expanded, be done as follows respectively: from the universal electric dictionary of band annotation, search each vocabulary in this domain lexicon successively, then each vocabulary annotation is put into training set corresponding to this field as a training data, the training set in this field can be obtained;

Through the operation of step 2, a corresponding domain lexicon to be expanded, can obtain the training set that belonging to a domain lexicon to be expanded, field is corresponding;

Step 3, pre-service is carried out to training set, obtain language material feature set;

On the basis of step 2 operation, successively pre-service is carried out to the corpus in the training set of each domain lexicon to be expanded, obtain the language material feature set that the training set in this field is corresponding, be specially: pre-service is carried out to the every bar training data in the training set in some fields, obtain one group of vocabulary that this training data is corresponding, be called language material character subset; The set of the language material character subset that the whole training datas in the training set in this field are corresponding is called the language material feature set that this domain lexicon is corresponding;

Described pre-service comprises participle, phrase extraction, lemmatization and removes stop words;

Step 4, on the basis of step one and step 3, for the leaf node on the domain classification tree that step one obtains, add up the number of times that in language material feature set corresponding to each leaf node, each vocabulary occurs in this language material feature set; For non-leaf nodes, first the language material feature set of the child node of each non-leaf nodes is merged, using the result that merges as the language material feature set of this non-leaf nodes, then add up following data: the number of times that in the language material feature set of 1. this non-leaf nodes, each vocabulary occurs in the language material feature set of this non-leaf nodes; 2. for each vocabulary in the language material feature set of this non-leaf nodes, the number of the language material feature set of this vocabulary in the language material feature set that the child node of this non-leaf nodes is corresponding, is comprised;

Step 5, step 4 operation basis on, calculate the degree of confidence of each vocabulary in each language material feature set according to formula (2);

w d c = \frac{w d}{Σ w d} \times \lg (\frac{w d}{d t} + 1) - - - (2)

Wherein, wdc represents the degree of confidence of the some vocabulary w in the language material feature set that a certain field d is corresponding; Wd represents the number of times that vocabulary w occurs in the d of field; Σ wd represents the total degree occurred in the language material feature set that the father node of the corresponding node of the language material feature set at vocabulary w place is corresponding; Dt represents the number comprising the language material feature set of this vocabulary w in the language material feature set that the brotgher of node of the corresponding node of the language material feature set at vocabulary w place is corresponding;

Step 6, new term is joined in domain lexicon to be expanded;

Step 6.1: carry out pre-service to new term annotation, obtains one group of vocabulary that this vocabulary annotation is corresponding, represents the quantity of this group vocabulary with n;

Step 6.2: using the root node in the classification tree of field as present node;

Step 6.3: according to formula (3) calculate successively new term set with domain classification in field corresponding to each child node of present node between degree of membership, and find out maximal value wherein, use symbol sdc _maxrepresent;

{sdc}_{k} = m_{k} \times Π_{j = 1}^{n} {wdc}_{j k} - - - (3)

Wherein, sdc _krepresent new term set with domain classification in degree of membership between field k corresponding to each child node of present node; wdc _jkrepresent the degree of confidence of a jth vocabulary and field k in one group of vocabulary that new term annotation is corresponding; m _krepresent in n the vocabulary that new term annotation is corresponding, in the number that the degree of confidence of field k is the highest;

Step 6.4: if the maximal value sdc of degree of membership that step 6.3 obtains _maxbe greater than preassigned threshold value, then judge this maximal value sdc further _maxwhether corresponding node is leaf node, if leaf node, then new term is added in domain lexicon corresponding to this node; If not leaf node, then by this maximal value sdc _maxcorresponding node, as present node, then turns back to step 6.3; If the maximal value sdc of the degree of membership that step 6.3 obtains _maxbe not more than preassigned threshold value, then using new term as popular word, do not add in any one domain lexicon to be expanded, end operation;