CN114757147A

CN114757147A - BERT-based automatic hierarchical tree expansion method

Info

Publication number: CN114757147A
Application number: CN202210350872.5A
Authority: CN
Inventors: 陶明阳; 王星; 陈吉; 张鑫; 刘亚
Original assignee: Liaoning Technical University; Linyi University
Current assignee: Liaoning Technical University; Linyi University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-15

Abstract

The invention discloses an automatic hierarchical tree expansion method based on BERT, which comprises the steps of extracting an entity set through a corpus and generating word vectors of the entity set, and performing preliminary completion on each entity space corresponding to a hierarchical tree input by a user; generating an optimal class name for each entity space by using a MASK mechanism of BERT, generating a candidate set for each entity space by using a class name guide expansion mode, and supplementing high-quality entities to the corresponding entity space after calculating the score of each candidate entity and the similarity score with the seed set; and carrying out entity disambiguation and obtaining a hierarchical tree expansion result. The method for expanding the automatic hierarchical tree based on the BERT provided by the invention utilizes a language model to understand the hierarchical tree result input by a user, obtains the candidate words at each position, fills the candidate words and finally obtains the hierarchical tree meeting the requirement of the user input result.

Description

BERT-based automatic hierarchical tree expansion method

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an automatic hierarchical tree expansion method based on BERT.

Background

Hierarchical trees have wide application to many downstream natural language processing tasks. Because the cost of manual labeling is high and the data quality is uneven, a method for automatically constructing a hierarchical tree is urgently needed. At present, the existing hierarchical tree expansion method is mainly the upper and lower relation of the "is-a", which greatly limits the applicability in each real task. Therefore, the invention aims to input a preset hierarchical tree context format by a user task, and the system completes the whole hierarchical tree according to the format. However, the existing expansion method does not achieve high precision and is low in efficiency. And do not meet the needs of downstream tasks well.

Two main tasks of the hierarchical tree expansion are optimized. Firstly, for width expansion, a BERT pre-training model is used, each entity space is endowed with a class name, candidate entities are obtained through the class names, and finally, width expansion results are obtained through ANNOY filtering. Second, for depth expansion, the superior-inferior relationship score of the two nodes is calculated using Word2 Vec.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an automatic hierarchical tree expansion method based on BERT, which utilizes a language model to understand the hierarchical tree result input by a user, obtains candidate words at each position, fills the candidate words and finally obtains the hierarchical tree meeting the requirement of the user input result.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a BERT-based automatic hierarchical tree expansion method, which comprises the following steps:

s1: extracting an entity set through a corpus, generating word vectors of the entity set, and performing preliminary completion on each entity space corresponding to a hierarchical tree input by a user;

s2: generating an optimal class name for each entity space by using a MASK mechanism of BERT, generating a candidate set for each entity space by using a class name guide expansion mode, and supplementing high-quality entities to the corresponding entity space after calculating the score of each candidate entity and the similarity score with the seed set;

s3: and carrying out entity disambiguation and obtaining a hierarchical tree expansion result.

Further, in step S1, relevant documents are searched through the internet and determined as positive samples, strong negative samples, irrelevant samples and background samples through manual review, and the samples are classified into a sensitive sample library, a non-sensitive sample library, an irrelevant sample library and a background sample library.

Further, the specific steps of step S1 are as follows:

step S1.1: extracting entities in the corpus as an extended entity set by using a data mining mode;

step S1.2: obtaining a Word vector corresponding to each entity by using a Word2Vec model;

step S1.3: for each entity space, ANNOY or word vector similarity is used for preliminary expansion, so that semantic information represented by the entity space can be more accurately represented.

Preferably, the specific steps of step S2 are as follows:

step S2.1: for each entity space, finding out the possible class names and scores thereof of the entity space through the MLM task of the BERT, and generating the optimal class name and negative class name set of the entity space through the scores;

step S2.2: using the optimal class name and the negative class name set to expand the entity in each entity space, using the expanded entity as a candidate set, and calculating the score of each candidate entity;

step S2.3: and calculating the similarity score of each candidate word and the seed entity by using an ANNOY algorithm, and weighting and summing the similarity score and the score of the class name extension to obtain an extension set of each entity space.

Further, the specific steps of step S3 are as follows:

step S3.1: counting entities appearing in different entity spaces more than 2 times, namely ambiguous entities;

step S3.2: and each entity only keeps the last position of the score to generate a final hierarchical tree expansion result.

Further, the specific steps of step S3.2 are:

first, if the entity is in the user-entered entity, directly discarding the entity;

second, ancestral entities in the ambiguous entity are preferentially retained;

third, the entities with higher similarity scores to the seed entities in the entity space are retained.

Therefore, the invention has the following beneficial effects:

1. firstly, extracting entities of a corpus through data mining and Word2Vec and generating corresponding Word vectors. Secondly, the ANNOY model and the word vector similarity with high efficiency are used for expanding the tree structure input by the user in a small scale. And finally, generating a candidate set for each entity space in a BERT-based class name extension mode, filtering the candidate set by ANNOY, and generating a final hierarchical tree extension result after an entity disambiguation module.

2. A certain number of entities are expanded for each entity space in advance to express semantic information of each entity space more accurately, and the expansion difficulty of the step is low, so that an expansion mode with higher efficiency is selected to improve the overall expansion efficiency.

3. And carrying out expansion on the basis of a pre-training model BERT, and improving the accuracy of the expansion.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.

FIG. 1 is a flow chart of a BERT-based automatic hierarchical tree expansion method of the present invention;

FIG. 2 is a flow chart of the hierarchical tree expansion algorithm of the present invention.

Detailed Description

Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.

The invention expands the hierarchical tree structure input by the user and returns a more complete hierarchical tree structure to the user. First, each entity space in the hierarchical tree is preliminarily complemented based on ANNOY and Word2Vec to enhance semantic information of each entity space. Secondly, the class expansion method based on the BERT expands each entity space respectively. And finally, performing entity disambiguation on the expanded hierarchical tree and returning a final expansion result to the user.

As shown in fig. 1 and 2, the BERT-based automatic hierarchical tree expansion method of the present invention includes the following steps:

step 1: and extracting the entity set through the corpus and generating a word vector of the entity set. And performing preliminary completion on each entity space corresponding to the hierarchical tree input by the user. The entity space is all entities under each entity node;

step 2: the best class name is generated for each entity space using the BERT MASK mechanism. And a candidate set is generated for each entity space in a manner of class name guided extension. After the score of each candidate entity and the similarity score of the seed set are calculated, high-quality entities are supplemented to a corresponding entity space;

and step 3: after step 2, an entity may be in 2 or more different entity spaces, so entity disambiguation is required and a hierarchical tree expansion result is obtained.

The specific steps of step 1 are as follows:

step 1.1: extracting entities in the corpus by using a data mining mode to serve as an extended entity set;

step 1.2: obtaining a Word vector corresponding to each entity by using a Word2Vec model;

step 1.3: for each entity space, ANNOY or word vector similarity is used for preliminary expansion, so as to more accurately represent semantic information represented by the entity space. The choice of the two ways is determined by the size of the existing entity in each entity space.

The specific steps of step 2 are as follows:

step 2.1: for each entity space, finding out possible class names and scores of the entity space through an MLM task of the BERT, and generating an optimal class name and a negative class name set of the entity space through the scores;

step 2.2: using the optimal class name and the negative class name set to expand the entity in each entity space, using the expanded entity as a candidate set, and calculating the score of each candidate entity;

step 2.3: and calculating the similarity score of each candidate word and the seed entity by using an ANNOY algorithm, and weighting and summing the similarity score and the score of the class name extension to obtain an extension set of each entity space.

The specific steps of step 3 are as follows:

step 3.1: counting entities appearing in different entity spaces more than 2 times, namely ambiguous entities;

step 3.2: and each entity only keeps the last position of the score to generate a final hierarchical tree expansion result. Firstly, if the entity is in the entity input by the user, directly discarding the entity; second, ancestral entities in the ambiguous entity are preferentially retained; third, the entities with higher similarity scores to the seed entities in the entity space are retained.

1) Physical space completion

In the present invention, the entity expansion of small batch is carried out on each entity space corresponding to the user input hierarchical tree y. Because the expansion scale is smaller, the ANNOY-based entity expansion method and the Word2 Vec-based entity expansion method with higher efficiency are selected, and each entity space P is passed through_iDifferent expansion methods are selected for the entity size of (1).

a. ANNOY-based entity expansion method

And carrying out entity expansion on the entity space P with the number of the entities reaching n by using an ANNOY-based entity expansion method. Firstly, all entities in all candidate entity sets E are encoded, wherein the encoding mode can be Word2Vec, Glove and the like. Second, the entity code is entered into the ANNOY database. And finally, for the entity space P, selecting n entities as a seed set S, searching the nearest neighbor entity and the similarity score of each seed entity S through ANNOY, ranking the entities according to the similarity score, and recording the ranking as L. And calculating the final score of each candidate entity e according to the ranking list in the following specific calculation mode:

wherein r is_iIn list L for entity e_iRank in (1). The main reasons for the expansion by ANNOY are: for the small-scale entity extension method, the accuracy of most entity extension methods can meet the expected requirement, so the efficiency of entity extension is prioritized. ANNOY adopts a data structure of a binary tree to carry out query, and the query efficiency is greatly improved.

b. Entity completion method based on Word2Vec

And (3) using an entity completion method based on Word2Vec for the entity space of which the number of the entities does not reach n. As shown in fig. 2, for entity e₂The lower entity space of (2) cannot be expanded in an ANNOY mode, and the method is to obtain an abstract Word vector e of the entity P in the entity space by combining the upper and lower relations of the brother entity space with the property of Word2Vec_v(red node in the figure). For the existence of t eligible neighbor entity spaces, e_vThe specific calculation process is as follows:

wherein the content of the first and second substances,

is the parent word vector of the neighboring entity space i,

average word vector of spatial nodes of neighboring entities, e_fIs the parent node of the current physical space. Abstract word vector e_vAnd generating a similar entity ranking list according to ANNOY as the central word vector of the entity space, and taking top-k entities with top ranking as an entity completion result.

2) Extension method based on class name

In the invention, how to expand each entity space based on class names is introduced, and the expansion is divided into 3 steps of class name generation, class name selection and entity expansion based on class names.

a. Class name generation

The class name generation module targets a set of entities into an entity space and generates a set of candidate class names for the entities. First, note that the object of class name generation is similar to the synonym detection task. Thus, the class probe query is constructed using six Hearst patterns. More specifically, three entities in the current set and a Hearst pattern are randomly selected to construct a query. For example, "[ MASK ]]such as Np₁，Np₂，and Np₃.". Wherein the content of the first and second substances,Np₁、Np₂、Np₁is 3 random entities of the entity space, [ MASK]The location predicted for the language model, i.e., the location of the class name. By repeating such a random selection process, a set of queries can be constructed and input into a pre-trained language model (BERT) to obtain the [ MASK ]]An entity that marks the location.

The above method can only generate unigram class names, which is not in line with the actual requirement. The specific solution is to query the LM for the first time and retrieve the first K most likely words, by following the [ MASK ]]Each retrieved word is then added to construct a new query. For example, "[ MASK ]]Class1 such as Np₁，Np₂，and Np₃.". This process is repeated a maximum of three times and retains the class names of all the generated noun phrases.

b. Class name selection

In this module, the candidate class names generated above will be ranked to select a best class name that represents the entire set of entities, and in the next module, some negative class names will be used to filter out erroneous entities.

First, a corpus-based measure of similarity between candidate entities e and class names c is introduced, and given a class name c, first 6 entity queries are constructed by constructing in 6 Hearst patterns, [ MASK ]]Is an entity, and 6 queries are input to the language model to obtain an entity set of 6 queries, namely X_c. Furthermore, X is used_eTo represent all the sets of entities e in the seed set. The similarity of e and c is defined as:

where cos (x, x') is the cosine similarity between the two vectors x and x 0. The internal max operator finds the maximum similarity between each occurrence of e and a set of entity probe queries constructed based on c. The outer max operator identifies the first k most similar queries of e, and then takes their average as the final similarity between entity e and class name c, similar to finding the k best occurrences of entity e that match any probe query of class c, so it improves the similarity measure that previously utilized only context-free representations of entity and class name.

After the entity class similarity score is defined, one entity may be selected from the current set and a ranked list of candidate class names may be obtained based on their similarity to the entity. Then, given a set of entities, a | E | ranking list L is obtained₁，L₂，…，L_|E|. Finally, aggregating all the lists into a final class name ranking list according to the scores, and selecting one of the first-ranked class names as a positive class name, which is denoted by c_pWith simultaneous selection ranked lower than the positive class name c in each list_pAs a negative class name set C_N。

c. Entity extension based on class name

In this module, the positive class name c selected above is utilized_pAnd a negative class name set C_NTo assist in selecting new entities to be added to the collection. Each entity e_iThe score of (2) is calculated together. The first scoring function being entity e_iAnd positive class name c_pThe score of (2) is calculated as follows:

in the formula, M^kIs defined in formula (3). This score is referred to as a local score since it looks only at the top-k best entities in the corpus. The second scoring function calculates the similarity between each candidate entity and the existing entities in the current set based on their context-free representations. Given a current entity set E, several entities are first extracted from E, denoted as E_sThen calculate each candidate entity e_iThe score of (2) is calculated as follows:

because it uses a context-free representation, better reflects the overall position of the entity in the embedding space, and measures entity-entity similarity in a more global sense, thus becoming a global score. Such global scores supplement the local scores described above, and their geometric mean is used to finally rank all candidate entities:

as the expansion process iterates, the wrong entities may be contained in the set, resulting in semantic shifts. Therefore, with the negative class names of the above selections, a new ranking algorithm is developed to improve the quality and robustness of entity selection. First, E is resampled from the current physical space E_sAnd T times are carried out to obtain T entity ordered lists. And secondly, obtaining T category ranking lists according to a category name sorting process. Finally, screening out entities meeting the conditions, wherein the entities belonging to the target semantic class can meet two conditions intuitively: (1) it appears in the first few bits of the multiple entity ranking tables; (2) selected positive class name c in its corresponding class ranking list_pShould be listed above any negative class name. Combining these two criteria, a rank aggregation score is defined, as follows:

wherein

Is an index function, r_iIs entity e_iRank list L of_iAnd finally, selecting the set of top entity last entity spaces.

3) Entity disambiguation

In this work, for each task-related entity, the goal is to find its single best position in the hierarchical tree of outputs. Therefore, when an entity is found to occur in multiple locations during the tree expansion process, the entity needs to be disambiguated, i.e., the best location where the entity should be located, to resolve such conflicts.

Given a set of conflicting nodes, C, corresponding to different locations of the same entity, the following three rules are applied to select the best location from the set. First, if the entity is among the entities input by the user, the entity is directly selected and the following two steps are skipped. Otherwise, for each pair of nodes in C, check if one of the nodes is an ancestor of the other node, and only the ancestor node is retained. Finally, the score for each remaining node e ∈ C is computed as follows:

where sib (e) represents the set of all siblings of e, and par (e) represents its parent. The skip mode characteristics in SK are selected based on their cumulative strength with the entities in sib (e). This equation essentially captures the joint similarity of a node to its siblings and its parent. The node with the highest confidence will be selected. Finally, for each node not selected in C, the whole subtree where its root is located is deleted, all siblings added thereafter are clipped and placed in the "child list" of its parent.

Eventually returning a hierarchical tree that satisfies the user input.

In the invention, 2 tasks of width expansion and depth expansion of the hierarchical tree construction are improved. For deep expansion, small-batch expansion is carried out on the entities in each entity space based on ANNOY algorithm so as to more accurately express the semantics of the entity space. For width expansion, each entity space is assigned a class name based on BERT, and then the entity space is expanded based on the class name.

While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. An automatic hierarchical tree expansion method based on BERT is characterized by comprising the following steps:

2. The BERT-based automatic hierarchical tree expansion method according to claim 1, wherein the step S1 is specifically performed as follows:

step S1.1: extracting entities in the corpus by using a data mining mode to serve as an extended entity set;

3. The BERT-based automatic hierarchical tree expansion method according to claim 1, wherein the step S2 is specifically performed as follows:

step S2.2: using the optimal class name and the negative class name set to expand the entity of each entity space, using the expanded entity as a candidate set, and calculating the score of each candidate entity;

4. The BERT-based automatic hierarchical tree expansion method according to claim 1, wherein the step S3 is specifically performed as follows:

5. The BERT-based automatic hierarchical tree expansion method according to claim 4, wherein the step S3.2 comprises the following specific steps:

second, ancestral entities in the ambiguous entity are preferentially retained;