CN111552806B

CN111552806B - Method for unsupervised construction of entity set in building field

Info

Publication number: CN111552806B
Application number: CN202010302187.6A
Authority: CN
Inventors: 万里; 秦梦瑶; 丁玉杨
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2021-11-02
Anticipated expiration: 2040-04-16
Also published as: CN111552806A

Abstract

The invention provides a method for unsupervised construction of an entity set in the building field, which comprises the following steps: s1, acquiring texts to be screened, and dividing each sentence in the acquired texts into M characters or/and words, wherein M is a positive integer greater than or equal to 1; s2, selecting the first K words with the highest overall score as a candidate word set D, wherein K is a positive integer greater than or equal to 1; and S3, screening out the vocabulary with similar semantic characteristics with the building field vocabulary in the candidate vocabulary set D as the field words. The invention can classify the words in the acquired text to be screened, and screen out and filter out words which do not belong to the building field.

Description

Method for unsupervised construction of entity set in building field

Technical Field

The invention relates to the technical field of knowledge maps, in particular to a method for unsupervised construction of an entity set in the building field.

Background

As a semantic network, knowledge-graph can solve many practical problems under the energization of big data. It is envisaged that there is still more knowledge that does not break the size bottleneck. Other types of knowledge representations will also be able to solve more practical problems under the enablement of large data. Knowledge required by more and more fields of application breaks through the scope of knowledge graph and demands other knowledge (such as production rules, Bayesian networks, decision trees and the like). Natural language is exceptionally complex: natural language has ambiguity, diversity, and semantic understanding has ambiguity and depends on context. The root cause of the difficulty in understanding natural language by machine is that human language understanding is based on the cognitive ability of human beings, and the background knowledge formed by the cognitive experience of human beings is the fundamental strut supporting human language understanding.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly creatively provides a method for unsupervised construction of an entity set in the building field.

In order to achieve the above object, the present invention provides a method for unsupervised construction of a building domain entity set, comprising the following steps:

s1, acquiring texts to be screened, and dividing each sentence in the acquired texts into M characters or/and words, wherein M is a positive integer greater than or equal to 1;

s2, selecting the first K words with the highest overall score as a candidate word set D, wherein K is a positive integer greater than or equal to 1;

and S3, screening out words with similar semantic features with the building field word set in the candidate word set D as field words.

In a preferred embodiment of the present invention, step S1 further includes: and mapping all the words or/and words obtained by segmentation to a word vector space.

In a preferred embodiment of the present invention, step S2 includes the following steps:

and S21, judging whether the calculated coagulation degree score is greater than or equal to a preset coagulation degree score:

if the calculated curdling degree score is greater than or equal to the preset curdling degree score, executing step S22;

if the calculated coagulation degree score is smaller than the preset coagulation degree score, discarding;

and S22, judging whether the calculated left adjacency score is greater than or equal to a preset left adjacency score:

if the calculated left adjacency score is greater than or equal to the preset left adjacency score, executing step S23;

if the calculated left adjacency score is smaller than the preset left adjacency score, the left adjacency score is discarded;

and S23, judging whether the calculated right adjacency score is larger than or equal to a preset right adjacency score:

if the calculated right adjacency score is larger than or equal to the preset right adjacency score, calculating the overall score condition of the right adjacency score;

and if the calculated right adjacency score is smaller than the preset right adjacency score, the right adjacency score is discarded.

In a preferred embodiment of the present invention, the condensation degree score is calculated in step S21 by:

p (X, Y) represents the joint probability of the appearance of the character X and the character Y in the text to be screened;

p (Y) represents the probability of the appearance of the character Y in the text to be screened;

MI (X, Y) represents a word X to word Y cohesion score;

the left-neighborhood score is calculated in step S22 by:

a represents a set formed by all characters appearing on the left side of a character W in a text to be screened;

a represents a certain word in the vocabulary set A;

aW represents the word with word a to the left of word W;

p (aW | W) represents a conditional probability that the word aW occurs under the condition that the word W occurs;

E_L(W) represents a left adjacency score;

the method of calculating the right adjacency score in step S23 is:

b represents a set formed by all characters appearing on the right side of the character W in the text to be screened;

b represents a certain word in the vocabulary set B;

wb represents the word with word b to the right of word W;

p (Wb | W) represents a conditional probability of the word Wb occurring under the condition of the word W;

E_R(W) represents a right adjacency score.

The overall score is calculated in step S23 by:

Score＝λ₁·MI(X,Y)+λ₂·E_L(W)+λ₃·E_R(W)，

wherein λ is₁Adjusting the coefficient for the condensation score;

MI (X, Y) represents a word X to word Y cohesion score;

λ₂adjusting the coefficients for the left adjacency score;

E_L(W) represents a left adjacency score;

λ₃adjusting the coefficient for the right adjacency score;

E_R(W) represents a right adjacency score.

In a preferred embodiment of the present invention, in step S3, the method for screening out the words in the candidate word set D having semantic features similar to those in the building field word set includes:

P(Z)＝∫P(Z|θ)P(θ)dθ，

wherein Z represents a word in the candidate word set D;

θ represents a parameter;

p (Z | θ) represents the conditional probability of the occurrence of the word Z under the condition of the occurrence of the parameter θ;

p (θ) represents a prior probability density function;

p (Z) represents the probability of the occurrence of the word Z under the parameter θ;

wherein D is_CRepresenting a set formed by words belonging to the building field word set in the candidate word set D;

Z_irepresenting a set of words D_CThe word in (1);

P(Z_i| θ) indicates that the word Z occurs under the condition that the parameter θ occurs_iA conditional probability of occurrence;

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(Z|D_C)＝∫P(Z|θ)P(θ|D_C)dθ，

wherein P (Z | θ) represents the conditional probability of the occurrence of the word Z under the condition of the occurrence of the parameter θ;

P(θ|D_C) Representing a posterior probability density function;

P(Z|D_C) Indicating that the word Z belongs to the set of words D_CThe probability of (d);

wherein, P (D)_C| θ) represents a likelihood function;

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(θ|D_C) Representing a posterior probability density function;

wherein, P (Z | D)_C) Indicating that the word Z belongs to the set of words D_CThe probability of (d);

p (Z) represents the probability of the occurrence of the word Z under the parameter θ; (ii) a

score (Z) the expression Z belongs to the set of words D in the building field_C(ii) a final score of;

if the word Z belongs to the word set D in the building field_CIf the final score is greater than or equal to the preset score value, the vocabulary Z in the candidate word set belongs to the building field word set D_C。

In a preferred embodiment of the present invention, step S1 includes the following steps:

S＝{s₁,s₂,s₃,…,s_ns represents the text to be screened, S_iRepresenting an ith sentence in the text to be screened, wherein i is a positive integer less than or equal to n;

s_i＝V₁V₂V₃…V_M，V_jrepresenting the jth character in the ith sentence in the text to be screened, wherein j is a positive integer less than or equal to M;

s_i′＝{V₁,V₂,V₃,…,V_M}，s_i' means that the ith sentence is divided into M words.

In a preferred embodiment of the present invention, the expression method of the word vector space is:

e_i＝W^wrdvⁱ，

wherein e is_iA low-dimensional vectorized representation representing a word;

W^wrdthe parameter matrix of the word is obtained through training;

vⁱrepresenting a high-dimensional vector input into the computer.

In summary, by adopting the technical scheme, the invention can classify the words in the acquired text to be screened, and screen out and filter out words which do not belong to the building field.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic block diagram of the process of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

The invention provides a method for unsupervised construction of an entity set in the building field, which comprises the following steps:

s3, extracting terms of the building field normative documents, expressing semantic features of building field vocabularies by word vectors, taking the vocabulary feature mean values with the same attributes as a building field vocabulary set, screening out vocabularies which have similar semantic features with the building field vocabulary set from the candidate vocabulary set D as field vocabularies, and filtering out non-field vocabularies irrelevant to the building field.

p (X) represents the probability of the appearance of the character X in the text to be screened;

MI (X, Y) represents a word X to word Y cohesion score;

the left-neighborhood score is calculated in step S22 by:

a represents a certain word in the vocabulary set A;

aW represents the word with word a to the left of word W;

E_L(W) represents a left adjacency score;

the method of calculating the right adjacency score in step S23 is:

b represents a certain word in the vocabulary set B;

wb represents the word with word b to the right of word W;

E_R(W) represents a right adjacency score.

The overall score is calculated in step S23 by:

Score＝λ₁·MI(X,Y)+λ₂·E_L(W)+λ₃·E_R(W)，

wherein λ is₁Adjusting the coefficient for the condensation score;

MI (X, Y) represents a word X to word Y cohesion score;

λ₂adjusting the coefficients for the left adjacency score;

E_L(W) represents a left adjacency score;

λ₃adjusting the coefficient for the right adjacency score;

E_R(W) represents a right adjacency score.

P(Z)＝∫P(Z|θ)P(θ)dθ，

wherein Z represents a word in the candidate word set D, and is also (D-D)_C) The word in (1); d_CRepresenting a set formed by words belonging to the building field word set in the candidate word set D;

θ represents a parameter;

p (θ) represents a prior probability density function;

Z_irepresenting a set of words D_CThe word in (1);

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(Z|D_C)＝∫P(Z|θ)P(θ|D_C)dθ，

P(θ|D_C) Representing a posterior probability density function;

wherein, P (D)_C| θ) represents a likelihood function;

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(θ|D_C) Representing a posterior probability density function;

e_i＝W^wrdvⁱ，

wherein e is_iA low-dimensional vectorized representation representing a word;

W^wrdthe parameter matrix of the word is obtained through training;

vⁱrepresenting a high-dimensional vector input into the computer.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for unsupervised construction of a building field entity set is characterized by comprising the following steps:

s3, screening out words with similar semantic features to the building field word set in the candidate word set D as field words; the method for screening out the words with similar semantic features with the building field word set in the candidate word set D comprises the following steps:

P(Z)＝∫P(Z|θ)P(θ)dθ，

wherein Z represents a word in the candidate word set D;

θ represents a parameter;

p (θ) represents a prior probability density function;

Z_irepresenting a set of words D_CThe word in (1);

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(Z|D_C)＝∫P(Z|θ)P(θ|D_C)dθ，

P(θ|D_C) Representing a posterior probability density function;

wherein, P (D)_C| θ) represents a likelihood function;

p (θ) represents a prior probability density function;

P(D_C) Representing a set of words D_CThe probability of occurrence;

P(θ|D_C) Representing a posterior probability density function;

2. The unsupervised construction method of building domain entity set of claim 1, further comprising, in step S1: and mapping all the words or/and words obtained by segmentation to a word vector space.

3. The unsupervised construction method of a set of building domain entities according to claim 1, characterized in that in step S2, it comprises the following steps:

4. The method for unsupervised construction of a set of building field entities according to claim 3, wherein the condensation score is calculated in step S21 by:

MI (X, Y) represents a word X to word Y cohesion score;

the left-neighborhood score is calculated in step S22 by:

a represents a certain word in the vocabulary set A;

aW represents the word with word a to the left of word W;

E_L(W) represents a left adjacency score;

the method of calculating the right adjacency score in step S23 is:

b represents a certain word in the vocabulary set B;

wb represents the word with word b to the right of word W;

E_R(W) represents a right adjacency score;

the overall score is calculated in step S23 by:

Score＝λ₁·MI(X,Y)+λ₂·E_L(W)+λ₃·E_R(W)，

wherein λ is₁Adjusting the coefficient for the condensation score;

MI (X, Y) represents a word X to word Y cohesion score;

λ₂adjusting the coefficients for the left adjacency score;

E_L(W) represents a left adjacency score;

λ₃adjusting the coefficient for the right adjacency score;

E_R(W) represents a right adjacency score.

5. The unsupervised construction method of a set of building domain entities according to claim 1, characterized in that in step S1, it comprises the following steps:

6. The method for unsupervised construction of a set of building realm entities according to claim 2, wherein the representation of the word vector space is:

e_i＝W^wrdvⁱ，

wherein e is_iA low-dimensional vectorized representation representing a word;

W^wrdthe parameter matrix of the word is obtained through training;

vⁱrepresenting a high-dimensional vector input into the computer.