CN106649662A

CN106649662A - Construction method of domain dictionary

Info

Publication number: CN106649662A
Application number: CN201611149314.3A
Authority: CN
Inventors: 张晓霞; 刘世林
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-05-10

Abstract

The invention relates to the field of natural language processing, and in particular relates to a construction method of a domain dictionary. The method comprises the following steps: on the basis of automatic acquisition of a text keyword, clustering to-be-processed texts to form different topic text sets; selecting a part of seed words in a domain text set of a to-be-constructed dictionary through manual examination; on the basis, analyzing the distance of the clustered topic text sets and the selected domain seed words in relationship, and only retaining the top text sets which are relatively close in relationship for expanding the domain dictionary; and in a related domain, performing automatic expansion of the domain dictionary by combining an algorithm in the related domain to obtain a corresponding dictionary. According to the method provided by the invention, the to-be-constructed domain dictionary can be automatically expanded through a few of part of seed words on the basis of automatic differentiation of the domains of the text topics; the construction efficiency of the dictionary is relatively high, the accuracy is high, and the pertinence of the domain is strong; the method has wide application prospect in text analysis and natural language processing field.

Description

A kind of construction method of domain lexicon

Technical field

Natural language processing field of the present invention, more particularly to a kind of domain lexicon construction method.

Background technology

With the fast development of internet, substantial amounts of, disclosed web data is generated, also therefore facilitated various being based on The medical treatment of the new industry of big data technology, such as internet, Internet education, enterprise or personal reference etc..These internets The rise of industry be unable to do without substantial amounts of data message analysis with flourishing.Natural language processing in big data analysis occupies important ground Position, the network text resource in the face of magnanimity with natural language processing analysis method by automatically, intelligently judging text Or certain Sentiment orientation that text publisher is contained, either suffer from the analysis of public opinion or business survey heavy to closing The practical application meaning wanted.Using these analysis results, correct anticipation is carried out to the development evolvement of thing, and then taken in advance Corresponding measure is realizing bigger positive effect.

And sentiment analysis mainly have two big class methods, a class is that, based on the method for machine learning, another kind of is based on dictionary Method.Method based on machine learning is to build grader first, and be analysed to text input is carried out in grader Analysis.The limitation of this method is to build grader, needs large-scale corpus to be trained grader, and is classified The selection of feature also has challenge very much, and the quality of feature selecting will directly affect the performance of grader.Based on the method for dictionary, Using the word in dictionary as feature, corresponding feature vocabulary is extracted by dictionary matching, on the basis of feature vocabulary is extracted With reference to setting model either algorithm judging the corresponding tendency of the text or property, the reliability of analysis is greatly increased.

It is targetedly to analyze and excavate, the word that different fields is taken based on the sentiment analysis method of sentiment dictionary Allusion quotation is also very different, and at present existing domain lexicon, but lacks the applicability to particular problem, and specific aim is not strong.Dividing When analysis specific field or concrete topic, using existing big and wide in range domain lexicon, preferably analysis can not be reached Effect, targetedly domain lexicon is very necessary for structure, but manual construction dictionary very takes time and effort；Magnanimity can not be met The demand of text analyzing.

The content of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, there is provided a kind of domain lexicon structure side Method, on the basis of text key word is obtained automatically, clusters to pending text, and the different field of formation or theme are literary This collection；According to analysis needs, a small amount of corresponding field seed words are chosen, the field after cluster or master are analyzed on this basis Topic text set is far and near with the relation of selected field seed words, only retains the closer field of relation or subject text collection as neck The source of domain lexicon extension.Carry out the automatic extension of domain lexicon with reference to word correlation analysis algorithm on this basis, and then Obtain corresponding domain lexicon.

In order to realize foregoing invention purpose, the invention provides technical scheme below：A kind of domain lexicon construction method, bag Step containing implemented below：

(1) keyword of each text in pending text set is extracted；

(2) pending text is clustered, generates N number of subject text collection, wherein N is integer and N >=2；

(3) seed words in field are chosen；

(4) seed words are counted and the frequency for occurring is concentrated in each subject text；The subject text collection that frequency exceedes threshold value is protected Stay, as the source text collection of domain lexicon extension；

(5) degree of association of seed words and each candidate word in the text of source text collection is calculated, the degree of association is reached threshold value is set Candidate word be stored in dictionary to be expanded as domain term.

Specifically, the inventive method includes participle, the pre-treatment step gone high frequency words, remove stop words.

Further, keyword in text is extracted using following algorithmic formula in the step (1).The calculating of the algorithm Formula is：

TR(v_i) it is word v in text_iImportance, d is damped coefficient, be traditionally arranged to be 0.85, N be in non-directed graph own The number of word, relat { v_iBe and word v_iThere are the set of words of cooccurrence relation, v_jIt is relat { v_iIn any one word, TR (v_j) It is v_jImportance, N (p_j) be and v_jThere is the number of the word of cooccurrence relation.

Further, procedure below is included to pending text cluster in the step (2)：

(2-1) when initial, each pending text is respectively a class；

Between class distance is defined as in two classes the maximum of distance between text pair two-by-two, and the computing formula of distance is such as between text Under：

Wherein C (t1, t2) represents the distance between text 1 and text 2, and t1 ∩ t2 are represented and included between text 1 and text 2 The number of same keyword, mid (t1, t2) represents the mean number comprising keyword in text 1 and text 2；Between class distance meter Calculate formula as follows：

Dist(c_a, c_b)=max { C (t_a, t_b), t_a∈c_a, t_b∈c_b}

Wherein, Dist (c_a, c_b) represent the distance between any two class cluster, c_aAnd c_bTwo classes, C (t are represented respectively_a, t_b) represent the distance between two texts, t_aAnd t_bTwo texts are represented respectively, and require t_a∈c_a、t_b∈c_b(2-2) calculate All classes distance between any two, the minimum class of distance is merged, and is named as cnew；

(2-3) merged initial classes cluster is deleted in pending text set, and new class cluster cnew is added to poly- In class result；

(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster. What is now included in pending text set is the N number of theme formed after cluster, and the concrete number of wherein N is answered according to actual The sets itself with institute.

As a kind of preferred：Candidate word is with the calculation of relationship degree formula of seed words in the step (5)：

The probability that wherein p (word1, word2) occurs jointly for word word1 and word word2, p (word1) and p (word2) Represent the probability that word word1 and word word2 occurs respectively.

As one kind preferably, in the step (2), N=3.

As one kind preferably, in the step (3), the number of selected seed words is 50-200.

Further, the step (3) sequentially can move on to the step (1) and, or before step (2).

As a kind of preferred, in the step (4), only retain seed words frequency of occurrences highest subject text collection as word The source text collection that allusion quotation expands.

As it is a kind of preferably, in the step (5) candidate word and the threshold value of seed words be set to MI (word1, word2)= 0.2, when the degree of association >=0.2 of vocabulary in text set and seed words, just the word is added to be built as extension vocabulary Dictionary in.

Compared with prior art, beneficial effects of the present invention：The present invention provides a kind of domain lexicon construction method, automatic On the basis of obtaining text key word, pending text is clustered, form different subject text collection；And choose a fixed number The field seed words of amount, find that the pending text set after clustering is remote with the relation for treating extension field by seed words automatically Closely, automatically identify cluster after text domain type on the basis of, only retain subject text collection in close relations to be led Domain lexicon extension.The accuracy of dictionary creation is higher, builds in hgher efficiency.

The inventive method, chooses a part of seed words, depending on the selection of seed words can be according to the concrete direction of analysis, therefore More there is specific aim, choose with the basis of field automatically discovery in seed words, calculate the text of seed words and source text collection The correlation degree of middle word is far and near, retains word in close relations as the expansion word of the domain lexicon；Compared to common domain term Allusion quotation, the domain lexicon constructed by the inventive method has higher flexible.The practicality of dictionary is higher, is adaptive to particular problem Or the text analyzing of theme.

Description of the drawings：

Fig. 1 realizes block diagram for the construction method of this area dictionary.

Fig. 2 is the realization procedure chart of this area word construction method step (5).

Specific embodiment

With reference to test example and specific embodiment, the present invention is described in further detail.But this should not be understood Scope for above-mentioned theme of the invention is only limitted to below example, and all technologies realized based on present invention belong to this The scope of invention.

A kind of domain lexicon construction method is provided, on the basis of text key word is obtained automatically, pending text is entered Row cluster, forms different subject text collection；It is concentrated through manually checking in the field text of dictionary to be built, chooses a part Seed words.The subject text collection after cluster is analyzed on this basis far and near with the relation of selected field seed words, only retain and close The closer subject text collection of system is carrying out domain lexicon extension.Carry out the automatic of domain lexicon in conjunction with algorithm on this basis Extension, obtains corresponding domain lexicon.The inventive method leads to too small amount of portion on the basis of automatic distinguishing text subject field Seed words are divided to expand the domain lexicon for wanting to build automatically；The structure efficiency of dictionary is higher, and accuracy is high, the pin in field It is very strong to property；Have wide practical use in text analyzing and natural language processing field.

In order to realize foregoing invention purpose, the invention provides technical scheme below：A kind of domain lexicon construction method, bag Containing implemented below step as shown in Figure 1：

(1) keyword of each text in pending text set is extracted；

(2) pending text is clustered, forms N number of subject text collection, wherein N is integer and N >=2；

(3) a small amount of field seed words are chosen；Choose the vocabulary with obvious domain features, the side of artificial selected seed word Formula, it is higher for the specific aim of specific field or problem, constructed dictionary it is applicable more flexible.

(4) seed words are counted and the frequency for occurring is concentrated in each subject text；The seed words frequency of occurrences is exceeded into the master of threshold value Topic text set retains, used as the source text collection of domain lexicon extension.Pending text set is classified by cluster, is defined The text collection of different themes, the correlation degree between text in same subject is higher, is that follow-up lexicon extension is carried out The preparation and screening of language material.

Formed after different themes text set by cluster, through calculating appearance frequency of the seed words in subject text keyword Rate, and then the distance of the relation between different themes and constructed dictionary field is analyzed, relation text set farther out is given up, this Sample is only carried out when lexicon extension is carried out in the nearer theme in field, substantially increases the quality of lexicon extension source language material, The accuracy of lexicon extension is obviously improved, simultaneously because being only that in the nearest text set in extended field dictionary expansion is carried out Exhibition, reduces the scope calculated during lexicon extension, reduces the amount of calculation of lexicon extension, improves the efficiency of lexicon extension.

(5) degree of association of seed words and each word of source text collection is calculated, the degree of association is reached the word of given threshold as neck Domain word is stored in dictionary to be expanded.

TR(v_i) it is word v in text_iImportance.D is damped coefficient, is traditionally arranged to be 0.85.N is (by text in non-directed graph After this participle, a non-directed graph is abstracted into, each word in its Chinese version is a node in figure) number of all words. relat{v_iBe and word v_iThere is the set of words of cooccurrence relation.v_jIt is relat { v_iIn any one word, TR (v_j) it is v_jWeight The property wanted, N (p_j) be and v_jThere is the number of the word of cooccurrence relation.

Calculating is iterated by this computing formula, TR (v are extracted_i) it is more than the key of the equivalent as the text of threshold value Word；It is that text cluster is prepared by the automatic extraction of keyword.

Further, procedure below is included to pending text cluster in the step (2)：

(2-1) when initial, each pending text is respectively a class；

Dist(c_a, c_b)=max { C (t_a, t_b), t_a∈c_a, t_b∈c_b}

(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster. What is now included in pending text set is the N number of theme formed after cluster, and the concrete number of wherein N is answered according to actual With and sets itself.

As a kind of preferred, step (2-4) N=3, pending text set is only divided into three themes, it is convenient follow-up Calculate.

As one kind preferably, in the step (3), the quantity of the field seed words for being extracted is 50-200.Choose Seed words are very few, will affect the accuracy of domain lexicon extension, cross and at most will increase the manpower and time cost chosen.

As a kind of preferred；In the step (4), only retain seed words frequency of occurrences highest subject text collection as word The source text collection that allusion quotation expands；This step concentrates the most close text set of selection and seed words relation from individual subject text so that word The characteristics of corpus of allusion quotation extension more conform to field, the extension quality of dictionary is higher, and specific aim is higher.

As a kind of preferred：Vocabulary is thought with the calculation of relationship degree of seed words using the calculating of mutual information in the step (5) Think, the computing formula for being adopted for：

The probability that wherein p (word1, word2) occurs jointly for word word1 and word word2, p (word1) and p (word2) Represent the probability that word word1 and word word2 occurs respectively.Mutual information algorithm is for the degree of association between analysis vocabulary, algorithm letter Clean easy realization, computational efficiency is higher；Mutual information is the analysis method of computational linguistics model, and it is measured between two objects Reciprocity.It is used for measures characteristic in filtration problem for the discrimination of theme.When domain lexicon structure is carried out, plant choosing On the basis of sub- word, the correlation of vocabulary to be expanded and existing seed words is calculated using the method for mutual information, the degree of correlation is got over Height represents that the word is higher with the relevance of seed words.

As one kind preferably, the threshold value of the step (5) is set to MI (word1, word2)=0.2, when time in text set When selecting the degree of association >=0.2 of word and seed words, just it is added to the word as extension vocabulary in the dictionary to be built, the step Suddenly the calculating process of (5) is as shown in Figure 2.

Claims

1. a kind of domain lexicon construction method, it is characterised in that comprising implemented below step：

(1) keyword of each text in pending text set is extracted；

(3) seed words in field are chosen；

(4) seed words are counted and the frequency for occurring is concentrated in each subject text；The subject text collection that frequency exceedes threshold value is retained, is made For the source text collection of domain lexicon extension；

(5) degree of association of seed words and each candidate word in the text of source text collection is calculated, the degree of association is reached into the candidate word of threshold value It is stored in dictionary to be expanded as domain term.

2. the method for claim 1, it is characterised in that include before the step (1)：Participle, go high frequency words, go to stop The pre-treatment step of word.

3. the method for claim 1, in the step (1) keyword, the public affairs are extracted using following computing formula Formula is：

T R (v_{i}) = \frac{1 - d}{N} + d \underset{v_{j} &Element; r e l a t {v_{i}}}{Σ} \frac{T R (v_{j})}{N (p_{j})}

TR(v_i) it is word v in text_iImportance, d is damped coefficient, and it is all words in non-directed graph to be traditionally arranged to be 0.85, N Number, relat { v_iBe and word v_iThere are the set of words of cooccurrence relation, v_jIt is relat { v_iIn any one word, TR (v_j) it is v_j Importance, N (p_j) be and v_jThere is the number of the word of cooccurrence relation.

4. method as claimed in claim 3, it is characterised in that：To pending text cluster comprising following in the step (2) Process：

(2-1) when initial, each pending text is respectively a class；

Between class distance is defined as in two classes the maximum of distance between text pair two-by-two, and the computing formula of distance is as follows between text：

C (t 1, t 2) = \frac{t 1 \cap t 2}{m i d (t 1, t 2)}

Wherein C (t1, t2) represents the distance between text 1 and text 2, and t1 ∩ t2 are represented between text 1 and text 2 comprising identical The number of keyword, mid (t1, t2) represents the mean number comprising keyword in text 1 and text 2；

Between class distance computing formula is as follows：

Dist(c_a, c_b)=max { C (t_a, t_b), t_a∈c_a, t_b∈c_b}

Wherein, Dist (c_a, c_b) represent the distance between any two class cluster, c_aAnd c_bTwo classes, C (t are represented respectively_a, t_b) table Show the distance between two texts, t_aAnd t_bTwo texts are represented respectively, and require t_a∈c_a、t_b∈c_b(2-2) calculate all Class distance between any two, the minimum class of distance is merged, and is named as cnew；

(2-3) merged class cluster is deleted in pending text set, and new class cluster cnew is added in cluster result；

(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster.

5. method as claimed in claim 4, it is characterised in that：The degree of association meter of candidate word and seed words in the step (5) Calculating formula is：

M I (w o r d 1, w o r d 2) = l o g \frac{p (w o r d 1, w o r d 2)}{p (w o r d 1) p (w o r d 2)}

Wherein p (word1, word2) is the probability that word word1 and word word2 occurs jointly, and p (word1) and p (word2) is represented The probability that word word1 and word word2 occur respectively.

6. method as claimed in claim 5, it is characterised in that：In the step (2), N=3.

7. method as claimed in claim 6, it is characterised in that：In the step (3), the number of selected seed words is 50-200.

8. method as claimed in claim 7, it is characterised in that：In the step (4), only retain seed words frequency of occurrences highest The source text collection that expands as dictionary of subject text collection.

9. method as claimed in claim 8, it is characterised in that：In the step (5), the degree of association of expansion word and seed words is treated Threshold value is set to：0.2.