CN103116573A

CN103116573A - Field dictionary automatic extension method based on vocabulary annotation

Info

Publication number: CN103116573A
Application number: CN2013100466473A
Authority: CN
Inventors: 黄河燕; 史树敏; 朱朝勇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-02-06
Filing date: 2013-02-06
Publication date: 2013-05-22
Anticipated expiration: 2033-02-06
Also published as: CN103116573B

Abstract

The invention relates to a field dictionary automatic extension method based on vocabulary annotation and belongs to the technical field of natural language processing. The field dictionary automatic extension method based on the vocabulary annotation comprises the following steps: (1) growing a field classification tree through analyzing the relevancy of a field dictionary which belongs to fields; (2) obtaining a training set for each field dictionary to be extended; (3) processing pretreatment to the training set to obtain a linguistic data feature set; (4) counting times of each panel point, corresponding to each vocabulary in the linguistic data feature set, appearing in the linguistic data feature set and the number of the linguistic data feature set of one vocabulary contained by a secondary panel point, corresponding to the linguistic data feature set, of the panel point; (5) calculating the confidence coefficient of each vocabulary in each linguistic data feature set; (6) adding new vocabulary to the field dictionary to be extended. The field dictionary automatic extension method based on the vocabulary annotation has no need to collect a field corpus by workers, so that the influence of the quality of the field corpus, limit of the scale and unbalance of the field corpus can be avoided.

Description

A kind of automatic extending method of field dictionary based on the vocabulary note

Technical field

The present invention relates to the automatic extending method of a kind of field dictionary, particularly a kind of automatic extending method of field dictionary based on the vocabulary note belongs to the natural language processing technique field.

Background technology

Field dictionary (Domain Dictionary) refers to the set of the distinctive term of specific area or expression way.The field dictionary is the basic resources of natural language processing, domain knowledge is widely used in the links such as the word sense disambiguation, syntactic analysis of the multiple-tasks such as mechanical translation, information retrieval, data mining and text classification, and the scale of field dictionary and quality are directly connected to the performance of related application.

The structure of field dictionary and extending method can be divided three classes according to automaticity: based on the artificial constructed and extending method of expertise, and Semi-Automatic Generation and extending method and full-automatic the generation and extending method.Artificial constructed high with the extending method accuracy rate, but need a large amount of domain experts to participate in for a long time, and cost of labor and time cost are too high, and lack real-time.Full-automatic generation and extending method are judged the domain attribute of vocabulary by analyzing the difference of vocabulary statistical property in the different field corpus, and the method need not domain expert's participation, has saved a large amount of costs of labor, but the accuracy rate that dictionary is included is not high.Automanual generation and extending method are specified a small amount of domain knowledge by the domain expert between artificial writing and automatic Generation, realize the automatic expansion of field dictionary.Existing semi-automatic and full automatic field dictionary methods needs the support of domain corpus mostly, the quality of the field dictionary that generates depends on the quality of the domain corpus that adopts, the completeness of field dictionary is subject to the restriction of domain corpus scale, simultaneously, consider the impact of the non-equilibrium property of corpus, the field of word mark is easier of corpus sweeping field deflection.Above-mentioned two kinds of methods all fail to effectively utilize existing dictionary resources, and the correlativity between the consideration field not.

Summary of the invention

The objective of the invention is the deficiency for the automatic extending method existence of at present existing field dictionary, propose a kind of automatic extending method of field dictionary based on the vocabulary note.

The objective of the invention is to be achieved through the following technical solutions.

A kind of automatic extending method of field dictionary based on the vocabulary note, its concrete operation step is:

Step 1, by the degree of correlation between field under the analysis field dictionary, generates a domain classification and sets.Be specially:

Step 1.1: represent pending node set with symbol D, and the original state of setting pending node set is for empty;

Step 1.2: the field dictionary that each is to be expanded is put in pending node set as a node respectively.Nodename is the title of this field dictionary, and node content is the whole entries in this field dictionary; Described entry comprises the explain information of vocabulary and this vocabulary.

Step 1.3: calculate respectively the degree of correlation between the field under the field dictionary of any two the node representatives in pending node set by formula (1), with symbol R (d ₁, d ₂) expression.

R (d_{1}, d_{2}) = \frac{| d_{1} \cap d_{2} |}{\min (| d_{1}, d_{2} |)} - - - (1)

Wherein, R (d ₁, d ₂) represent that in pending node set, a certain field dictionary (is used symbol D ₁Expression) symbol d (is used in affiliated field ₁The expression) and another field dictionary (use symbol D ₂Expression) symbol d (is used in affiliated field ₂Expression) the degree of correlation; | d ₁∩ d ₂| expression field dictionary D ₁With field dictionary D ₂The number of the identical vocabulary that comprises; Min (| d ₁, d ₂|) expression field dictionary D ₁With field dictionary D ₂The vocabulary number that the field dictionary of middle negligible amounts comprises.

Step 1.4: the degree of correlation R (d between the field dictionary of any two the node representatives in the pending node set that obtains from step 1.3 ₁, d ₂) in find out maximal value, use symbol R _maxExpression; This maximal value R _maxTwo corresponding field dictionaries are used respectively symbol D ₁' and D ₂' expression, field dictionary D ₁' and D ₂' affiliated field use respectively symbol d ₁' and d ₂' expression, field dictionary D ₁' and D ₂' in content use respectively symbol c ₁And c ₂Expression.

Step 1.5: with field dictionary D ₁' and D ₂' in entry and also, and give and and after new title of dictionary definition, use D _newExpression; Should and and after dictionary D _newContent symbol c _newExpression, c _new=c ₁∪ c ₂Then set up a new node, the name of new node is called D _new, the content of new node is c _newField dictionary D ₁' and D ₂' as node D _newChild node.

Step 1.6: with new node D _newJoin in pending node set, and with node D ₁' and D ₂' delete from pending node set.

Step 1.7: add up the number of node in pending node set, N represents with symbol.If N 〉=2 turn back to step 1.3; Otherwise, end operation.

Through the operation of above-mentioned steps, namely obtain a domain classification tree.

Step 2, obtain a training set for each field dictionary to be expanded.

This step can with step 1 synchronous operation: determine one with the universal electric dictionary of note, then for the vocabulary in each field dictionary to be expanded, be done as follows respectively: search successively each vocabulary in this field dictionary from the universal electric dictionary with note, then the note that each vocabulary is corresponding is put into training set corresponding to this field as a training data, can obtain the training set in this field.

Through the operation of step 2, corresponding field dictionary to be expanded can obtain training set corresponding to the affiliated field of a field dictionary to be expanded.

Step 3, training set is carried out pre-service, obtain the language material feature set.

On the basis of step 2 operation, successively the corpus in the training set of each field dictionary to be expanded is carried out pre-service, obtain the language material feature set corresponding to training set in this field, be specially: every training data in the training set in some fields is carried out participle, phrase extraction, lemmatization and goes the pre-service such as stop words, obtain one group of vocabulary corresponding to this training data, be called the language material character subset.The set of the language material character subset that the whole training datas in the training set in this field are corresponding is called language material feature set corresponding to this field dictionary.

Step 4, on the basis of step 1 and step 3, the leaf node on the domain classification tree that obtains for step 1 is added up the number of times that in language material feature set corresponding to each leaf node, each vocabulary occurs in this language material feature set.For non-leaf node, at first the language material feature set of the child node of each non-leaf node is carried out and also, with the result that the merges language material feature set as this non-leaf node, then add up following data: the number of times that 1. in the language material feature set of this non-leaf node, each vocabulary occurs in the language material feature set of this non-leaf node; 2. for each vocabulary in the language material feature set of this non-leaf node, comprise the number of the language material feature set of this vocabulary in language material feature set corresponding to the child node of this non-leaf node.

Step 5, on the basis of step 4 operation, calculate the degree of confidence of each vocabulary in each language material feature set according to formula (2).

wdc = \frac{wd}{Σwd} \times \log (\frac{wd}{dt} + 1) - - - (2)

Wherein, wdc represents the degree of confidence of the some vocabulary (w represents with symbol) in language material feature set corresponding to a certain field (d represents with symbol); Wd represents the number of times that vocabulary w occurs in the d of field; Σ wd represents the total degree that occurs in language material feature set corresponding to the father node of corresponding node of language material feature set at vocabulary w place; Dt represents to comprise in language material feature set corresponding to the brotgher of node of corresponding node of language material feature set at vocabulary w place the number of the language material feature set of this vocabulary w.

Step 6, new term is joined in field dictionary to be expanded.

On the basis of step 5 operation, the vocabulary of newly including in the universal electric dictionary with note described in step 2 as new term, is added in field dictionary to be expanded, concrete operation step is:

Step 6.1: the note to new term carries out participle, phrase extraction, lemmatization and goes the pre-service such as stop words, obtains one group of vocabulary corresponding to this vocabulary note, represents the quantity of this group vocabulary with n.

Step 6.2: with the root node in the classification tree of field as present node.

Step 6.3: calculate successively degree of membership between field corresponding to each child node of the present node in new term and domain classification tree according to formula (3), and find out maximal value wherein, use symbol sdc _maxExpression.

{sdc}_{k} = m_{k} \times Π_{j = 1}^{n} {wdc}_{jk} - - - (3)

Wherein, sdc _kDegree of membership during expression new term and domain classification are set between field (k represents with symbol) corresponding to each child node of present node; wdc _jkThe degree of confidence of j vocabulary and field k in one group of vocabulary corresponding to expression new term note; m _kRepresent in n vocabulary corresponding to new term note, in the highest number of degree of confidence of field k.

Step 6.4: if the maximal value sdc of the degree of membership that step 6.3 obtains _maxGreater than preassigned threshold value, further judge this maximal value sdc _maxWhether corresponding node is leaf node, if leaf node adds new term in field dictionary corresponding to this node; If not leaf node, with this maximal value sdc _maxThen corresponding node turns back to step 6.3 as present node.If the maximal value sdc of the degree of membership that step 6.3 obtains _maxBe not more than preassigned threshold value, with new term as popular word, do not add in any one field dictionary to be expanded end operation to.

Operation through above-mentioned steps can realize the automatic expansion to the field dictionary.

Beneficial effect

The present invention proposes to compare with the existing field automatic extending method of dictionary based on the automatic extending method of field dictionary of vocabulary note, therefore its advantage is not need manually to collect domain corpus, has avoided being subjected to the limitation of quality and scale of domain corpus and the impact of the non-equilibrium property of domain corpus.

Description of drawings

Fig. 1 is the domain classification tree in the specific embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

Lexical information and the common factor between dictionary of communication in the mechanical dictionary of Huajian, aviation, machinery and four field dictionaries of computing machine are as shown in table 1.In table 1, comprise respectively 12626 vocabulary, 7592 vocabulary, 19250 vocabulary, 5156 vocabulary in the field dictionary in communication, aviation, machinery and computing machine four fields.The common factor quantity of communication and aviation field dictionary is 4432; The common factor quantity of communication and mechanical field dictionary is 6210; The common factor quantity of communication and computing machine is 2705; The common factor quantity of aviation and mechanical field dictionary is 4908; The common factor quantity of aviation and computer realm dictionary is 2064; The common factor quantity of machinery and computing machine is 2383.

The lexical information of four field dictionaries of table 1 and the common factor information slip between dictionary

?	Communication	Aviation	Machinery	Computing machine
					Communication	12626	4432	6210	2705
Aviation	4432	7592	4908	2064
					Machinery	6210	4908	19250	2383
Computing machine	2705	2064	2383	5156

The automatic extending method of field dictionary based on the vocabulary note that uses the present invention to propose expands automatically to communication in the mechanical dictionary of Huajian, aviation, machinery and four field dictionaries of computing machine, and its concrete operation step is:

Step 1.1: the original state of setting pending node set D is empty;

Step 1.2: " communication ", " aviation ", " machinery " and " computing machine " four field dictionaries are put in pending node set as a node respectively.Nodename is the title of this field dictionary, and node content is the whole entries in this field dictionary; Described entry comprises the explain information of vocabulary and this vocabulary.

Step 1.3: calculate respectively the degree of correlation R (d between the field under the field dictionary of any two the node representatives in pending node set by formula (1) ₁, d ₂).

Step 1.4: be aviation and machinery by calculating two fields that the degree of correlation is the highest as can be known.

Step 1.5: aviation and machinery are merged into a node, calculate new node " Hang Kong ﹠amp; Machinery " respectively with computing machine and the degree of correlation of communicating by letter

Step 1.6: with new node " Hang Kong ﹠amp; Machinery " join in pending node set, and " aviation " and " machinery " deleted from pending node set.

Step 1.7: in pending node set, the number of node is 3, and then repeating step 1.3 to 1.7.Until only have a node in pending node set, can obtain a domain classification tree, as shown in Figure 1.The root node Root of domain classification tree has two child nodes, is respectively " aviation; Machinery " with " ﹠amp communicates by letter; Computing machine "; Node " Hang Kong ﹠amp; Machinery " under two child nodes are arranged, be respectively " aviation " and " machinery "; Node " communication ﹠amp; Computing machine " under two child nodes are arranged, be respectively " communication " and " computing machine ".

Step 2, obtain a training set for each field dictionary to be expanded.

Step 6, new term is joined in field dictionary to be expanded.

Step 6.3: calculate successively degree of membership between field corresponding to each child node of the present node in new term and domain classification tree according to formula (3), and find out maximal value sdc wherein _max

Step 6.4: if the maximal value sdc of the degree of membership that step 6.3 obtains _maxGreater than preassigned threshold value 0.7, further judge this maximal value sdc _maxWhether corresponding node is leaf node, if leaf node adds new term in field dictionary corresponding to this node; If not leaf node, with this maximal value sdc _maxThen corresponding node turns back to step 6.3 as present node.If the maximal value sdc of the degree of membership that step 6.3 obtains _maxBe not more than preassigned threshold value, with new term as popular word, do not add in any one field dictionary to be expanded end operation to.

Claims

1. automatic extending method of field dictionary based on the vocabulary note, it is characterized in that: its concrete operation step is:

Step 1, by the degree of correlation between field under the analysis field dictionary, generates a domain classification and sets; Be specially:

Step 1.2: the field dictionary that each is to be expanded is put in pending node set as a node respectively; Nodename is the title of this field dictionary, and node content is the whole entries in this field dictionary; Described entry comprises the explain information of vocabulary and this vocabulary;

Step 1.3: calculate respectively the degree of correlation between the field under the field dictionary of any two the node representatives in pending node set by formula (1);

R (d_{1}, d_{2}) = \frac{| d_{1} \cap d_{2} |}{\min (| d_{1}, d_{2} |)} - - - (1)

Wherein, R (d ₁, d ₂) expression pending node set in a certain field dictionary D ₁Affiliated field d ₁With another field dictionary D ₂Affiliated field d ₂The degree of correlation; | d ₁∩ d ₂| expression field dictionary D ₁With field dictionary D ₂The number of the identical vocabulary that comprises; Min (| d ₁, d ₂|) expression field dictionary D ₁With field dictionary D ₂The vocabulary number that the field dictionary of middle negligible amounts comprises;

Step 1.4: the degree of correlation R (d between the field dictionary of any two the node representatives in the pending node set that obtains from step 1.3 ₁, d ₂) in find out maximal value, use symbol R _maxExpression; This maximal value R _maxTwo corresponding field dictionaries are used respectively symbol D ₁' and D ₂' expression, field dictionary D ₁' and D ₂' affiliated field use respectively symbol d ₁' and d ₂' expression, field dictionary D ₁' and D ₂' in content use respectively symbol c ₁And c ₂Expression;

Step 1.5: with field dictionary D ₁' and D ₂' in entry and also, and give and and after new title of dictionary definition, use D _newExpression; Should and and after dictionary D _newContent symbol c _newExpression, c _new=c ₁∪ c ₂Then set up a new node, the name of new node is called D _new, the content of new node is c _newField dictionary D ₁' and D ₂' as node D _newChild node;

Step 1.6: with new node D _newJoin in pending node set, and with node D ₁' and D ₂' delete from pending node set;

Step 1.7: add up the number of node in pending node set, N represents with symbol; If N 〉=2 turn back to step 1.3; Otherwise, end operation;

Through the operation of above-mentioned steps, namely obtain a domain classification tree;

Step 2, obtain a training set for each field dictionary to be expanded;

This step can with step 1 synchronous operation: determine one with the universal electric dictionary of note, then for the vocabulary in each field dictionary to be expanded, be done as follows respectively: search successively each vocabulary in this field dictionary from the universal electric dictionary with note, then the note that each vocabulary is corresponding is put into training set corresponding to this field as a training data, can obtain the training set in this field;

Through the operation of step 2, corresponding field dictionary to be expanded can obtain training set corresponding to the affiliated field of a field dictionary to be expanded;

Step 3, training set is carried out pre-service, obtain the language material feature set;

On the basis of step 2 operation, successively the corpus in the training set of each field dictionary to be expanded is carried out pre-service, obtain the language material feature set corresponding to training set in this field, be specially: every training data in the training set in some fields is carried out pre-service, obtain one group of vocabulary corresponding to this training data, be called the language material character subset; The set of the language material character subset that the whole training datas in the training set in this field are corresponding is called language material feature set corresponding to this field dictionary;

Described pre-service comprises participle, phrase extraction, lemmatization and removes stop words;

Step 4, on the basis of step 1 and step 3, the leaf node on the domain classification tree that obtains for step 1 is added up the number of times that in language material feature set corresponding to each leaf node, each vocabulary occurs in this language material feature set; For non-leaf node, at first the language material feature set of the child node of each non-leaf node is carried out and also, with the result that the merges language material feature set as this non-leaf node, then add up following data: the number of times that 1. in the language material feature set of this non-leaf node, each vocabulary occurs in the language material feature set of this non-leaf node; 2. for each vocabulary in the language material feature set of this non-leaf node, comprise the number of the language material feature set of this vocabulary in language material feature set corresponding to the child node of this non-leaf node;

Step 5, on the basis of step 4 operation, calculate the degree of confidence of each vocabulary in each language material feature set according to formula (2);

wdc = \frac{wd}{Σwd} \times \log (\frac{wd}{dt} + 1) - - - (2)

Wherein, wdc represents the degree of confidence of the some vocabulary w in language material feature set corresponding to a certain field d; Wd represents the number of times that vocabulary w occurs in the d of field; Σ wd represents the total degree that occurs in language material feature set corresponding to the father node of corresponding node of language material feature set at vocabulary w place; Dt represents to comprise in language material feature set corresponding to the brotgher of node of corresponding node of language material feature set at vocabulary w place the number of the language material feature set of this vocabulary w;

Step 6, new term is joined in field dictionary to be expanded;

Step 6.1: the note to new term carries out pre-service, obtains one group of vocabulary corresponding to this vocabulary note, represents the quantity of this group vocabulary with n;

Step 6.2: with the root node in the classification tree of field as present node;

Step 6.3: calculate successively degree of membership between field corresponding to each child node of the present node in new term and domain classification tree according to formula (3), and find out maximal value wherein, use symbol sdc _maxExpression;

{sdc}_{k} = m_{k} \times Π_{j = 1}^{n} {wdc}_{jk} - - - (3)

Wherein, sdc _kRepresent the degree of membership between the new term field k corresponding with each child node of present node in the domain classification tree; wdc _jkThe degree of confidence of j vocabulary and field k in one group of vocabulary corresponding to expression new term note; m _kRepresent in n vocabulary corresponding to new term note, in the highest number of degree of confidence of field k;

Step 6.4: if the maximal value sdc of the degree of membership that step 6.3 obtains _maxGreater than preassigned threshold value, further judge this maximal value sdc _maxWhether corresponding node is leaf node, if leaf node adds new term in field dictionary corresponding to this node; If not leaf node, with this maximal value sdc _maxThen corresponding node turns back to step 6.3 as present node; If the maximal value sdc of the degree of membership that step 6.3 obtains _maxBe not more than preassigned threshold value, with new term as popular word, do not add in any one field dictionary to be expanded end operation to;