CN116050419B

CN116050419B - Unsupervised identification method and system oriented to scientific literature knowledge entity

Info

Publication number: CN116050419B
Application number: CN202310323198.6A
Authority: CN
Inventors: 张晖; 兰浩宇; 杨春明; 陈洋
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-06-02
Anticipated expiration: 2043-03-30
Also published as: CN116050419A

Abstract

The invention relates to the technical field of knowledge entity identification, and discloses an unsupervised identification method and system for a scientific literature knowledge entity. The method solves the problems that scientific text data resources which lack a public data set are difficult to identify when the knowledge entity is identified in the prior art.

Description

Unsupervised identification method and system oriented to scientific literature knowledge entity

Technical Field

The invention relates to the technical field of knowledge entity identification, in particular to an unsupervised identification method and an unsupervised identification system for scientific literature knowledge entities.

Background

The knowledge entity in the scientific literature refers to a term entity capable of expressing a key knowledge point in the professional literature, and contains rich scientific knowledge. In recent years, the recognition and extraction of knowledge entities in scientific literature are widely focused, and conferences related to the subject, such as "a scientific literature knowledge entity extraction and evaluation seminar", "a scientific text natural language processing seminar", etc., are successively held, so as to aim at discussing how to accurately and comprehensively recognize and extract knowledge entities from scientific texts, which has important significance for constructing a knowledge system in a specific scientific field.

Currently, in related research on knowledge entity and its category identification and extraction, the main method mainly includes: methods based on manual extraction, dictionary and rule based methods, traditional machine learning based methods, and deep learning based methods. The better research work is carried out under the supervision or semi-supervision condition, which needs a large amount of high-quality labeling data as a corpus basis, however, the specific scientific field often lacks such labeling data as support, and manual intervention is needed to complete the data labeling work. Because the division of the types of the knowledge entities has no fixed standard due to different fields, the types of the knowledge entities can be generally divided into entity categories such as method categories, tool categories, theory categories, resource categories and the like, so that non-field experts cannot carry out corpus labeling work, and the time and human resource cost are greatly improved.

The existing unsupervised knowledge entity identification method is still in an exploration stage, and the effect is not superior to that of the supervised learning method, but the manual labeling work can be avoided. The basic idea principle of the method is that a set of entity and category representative words is constructed by using the disclosed structured data (electric service manual) to act as a guiding basis, meanwhile, the words in the text are predicted by using the full word covering technology, and then the similarity between the text words and the representative words is calculated, so that the named entity identification and the type judgment are completed. In the method, the construction of the representative word set as a guiding basis needs the support of the disclosed structured data, but the specific scientific field lacks of the disclosed data set, and only text data resources without labels exist, so that the method cannot be directly migrated to knowledge entity identification of scientific literature.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an unsupervised identification method and an unsupervised identification system for a scientific literature knowledge entity, which solve the problems that scientific text data resources lack of a public data set are difficult to identify when the knowledge entity is identified in the prior art.

The invention solves the problems by adopting the following technical scheme:

the method comprises the steps of pre-training a full-word covering model by utilizing unlabeled scientific document text data, constructing a set of representative words and categories thereof of the knowledge entity by combining a contrast learning and clustering method as a judgment basis, predicting words in a scientific document text by utilizing the pre-trained full-word covering model, judging whether the words in the scientific document text are the knowledge entity by calculating the similarity between the predicted words and the representative words, and determining the categories of the words in the scientific document text.

As a preferred technical scheme, the method comprises the following steps:

s1, pre-training: processing the collected unlabeled scientific literature text data to obtain a training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, then inputting the full-word covering model to train the full-word covering model after the training corpus is subjected to word segmentation processing by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;

S2, learning the knowledge entity category representative word: inputting the training corpus after word segmentation in the S1 by combining with the domain dictionary into a word vector representation model for training to obtain the vector representation of the words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and the categories thereof, wherein the set is used as a basis for judging whether the text words are the knowledge entities in the recognition flow;

s3, knowledge entity identification: masking the words in the scientific literature text to be recognized, predicting the masking words by using a trained full-word masking model, and then calculating similarity scores between the obtained predicted words and the words in the representative word set constructed by the S2 so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words.

As a preferred technical solution, step S2 includes the following steps:

s21, training the word vector representation model of the training corpus after word segmentation, extracting word vectors of words in the domain dictionary, and carrying out twice data enhancement on the extracted word vectors to obtain two new word vectors which are the same as the original word vector and semantic features but have different values, wherein the new word vectors are positive samples;

S22, the generated two new word vectors and any other word vector after data enhancement are negative examples, N word vectors are recorded before two times of data enhancement, 2N word vectors are recorded in the word vector space after two times of data enhancement, and then the two new word vectors and 2N-2 other word vectors are negative examples;

s23, relearning and representing the word vectors subjected to two times of data enhancement into a new vector space through a comparison learning structure model, wherein the distance between positive class samples is limited to be more and more near by using a loss function in the space, and the distance between negative class samples is more and more far, so that the word vectors can be dispersed and uniformly distributed in the new representation space as much as possible;

s24, clustering the re-characterized word vectors, calculating the semantic similarity between the cluster center and other words after clustering, setting a threshold value, screening entity words with the semantic similarity larger than the set threshold value, and simultaneously combining specific entity words in each cluster to determine the category represented by the cluster, so that the required knowledge entity representing word set is obtained.

As a preferred embodiment, in step S23, the loss function is as follows:

；

；

；

wherein ,

、/>

、/>

a number representing a sample; />

Representation->

and />

Loss of the composed sample pair; / >

The representation number is

Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>

The expression number is->

The expression number is->

The vector of the sample converted by the contrast learning structural model; />

Representing the similarity of the two samples, and calculating by adopting cosine similarity; n represents the total number of samples before data enhancement; />

Representing the adjustment parameters, the value of 0 or 1, representing when +.>

When (I)>

The value is 1, otherwise, the value is 0; />

Representing a temperature parameter for controlling the degree of uniformity of the sample distribution; />

Representing the final loss function.

As a preferable technical scheme, in step S24, the K-means algorithm is adopted to cluster the re-characterized word vectors, and the method comprises the following steps:

s241, selecting K words in the word vector space re-characterized after contrast learning as initial cluster centers;

s242, calculating the distance between all other word vectors in the word vector space and the center of each cluster, and dividing each sample word into clusters closest to each other, wherein the probability that the sample word corresponding to the word vector belongs to the cluster class is considered to be larger as the distance between the word vector and the center of each cluster is smaller;

s243, after all sample words in the vector space are calculated, calculating the average value vector of all sample words in each cluster, taking the average value vector of all sample words in each cluster as a new cluster center, and updating the original cluster center; the mean vector calculation formula of the sample word is as follows:

；

in the formula ,

；

wherein ,

mean vector representing sample word, ++>

Representing a cluster->

Representation->

A certain vector in the cluster, < >>

Representation->

The number of sample words of the cluster;

s244, repeating the steps S241 to S243 until the cluster center is not changed, and completing training.

As a preferable embodiment, the cluster number K is set as follows:

assuming that the data to be classified are clustered through a clustering algorithm, and finally obtaining K clusters; for each sample word in each cluster, its contour coefficient is calculated separately, and for each sample word the following index is calculated:

: average value of distances from sample points to other sample points belonging to the same cluster; />

The smaller the value, the greater the likelihood that the sample point belongs to the category;

: average distance of sample point to all samples in other clusters +.>

Minimum value->

The calculation formula of (2) is as follows:

；

sample point

The profile coefficients of (a) are:

；

wherein ,

representing sample points->

Is a contour coefficient of (2);

the average value of the contour coefficients of all sample points is the average contour coefficient of the clustering result

，/>

The method comprises the steps of carrying out a first treatment on the surface of the The closer the distance between samples in a cluster, the more samples between clustersThe farther the distance is, the larger the average contour coefficient is, and the better the clustering effect is.

As a preferred technical solution, S3 comprises the following steps:

s31, word segmentation is carried out on the text to be detected, and nouns in the text are recognized

And covering;

s32, predicting possible output words of the covered words by using the whole word covering model obtained in S14

；

S33, combining the knowledge entity representative word set obtained in the S2, and calculating covered words

Belongs to category->

Score of->

；

S34, setting a threshold, and when the score is

If the threshold value is greater than the threshold value, the covering word is determined as a knowledge entity and belongs to the corresponding entity class +.>

Otherwise, the cover term is deemed not to be a knowledge entity. />

As a preferred embodiment, in step S33, the word is covered

Belongs to category->

Score of->

The calculation method of (2) is as follows:

s341, masking words by using the pre-trained whole word masking model

Predicting possible words ++>

The method comprises the steps of carrying out a first treatment on the surface of the Setting threshold +.>

Predicted probability +.>

Vocabulary of->

Taking out, and calculating the +.>

Are->

All representative words->

Average semantic similarity of (2); the extracted predictive word ++>

And entity class->

Weighted average is performed on the semantic similarity of (2) to finally obtain the cover word +.>

And entity class->

Semantic similarity +.>

The formula is as follows:

；

wherein ,

The number of the representative words contained in a certain category in the representative word set;

s342, setting the entity types containing more entity words to have larger weights, setting different weights for different entity types in the representative word set, using

Indicating that the entity cluster contains +.>

The weight calculation formula is as follows:

；

wherein ,

representation category->

The weight value given;

s343, recalculate the cover word

Type of attribution->

Score of->

The calculation formula is as follows:

。

as a preferred technical solution, step S1 includes the following steps:

s11, collecting title, key words and abstract data of related scientific field documents from a public database to form basic corpus data, adding the key words to a field dictionary after removing the key words repeatedly and manually to remove words which obviously do not belong to a knowledge entity, and splicing the title and abstract data to form the basic corpus;

s12, extracting character strings with frequency in a specified range from the basic corpus by adopting an N-gram string frequency statistical algorithm, then, updating the frequency of the character string words of the existing domain dictionary, and directly adding the character string words which do not appear in the domain dictionary and the frequency thereof into the domain dictionary;

s13, performing word segmentation processing by combining the basic corpus with the domain dictionary, performing full-word covering processing on words appearing in the domain dictionary, and training by adopting a full-word covering model so that the full-word covering model obtains the context semantic representation of the words in the domain.

The unsupervised identification system for the scientific literature knowledge entity is used for realizing the unsupervised identification method for the scientific literature knowledge entity, and comprises the following modules connected in sequence:

the pre-training module: the method comprises the steps of processing collected unlabeled scientific literature text data to obtain training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, inputting the full-word covering model to train the full-word covering model after word segmentation processing is carried out on the training corpus by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;

the knowledge entity category represents a word learning module: the method comprises the steps of inputting training corpus after word segmentation in combination with a domain dictionary into a word vector representation model for training to obtain vector representations of words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and categories thereof, wherein the set is used as a basis for judging whether text words are knowledge entities in a recognition process;

knowledge entity identification module: the method comprises the steps of carrying out covering processing on words in a scientific literature text to be recognized, predicting the covered words by using a trained full word covering model, calculating similarity scores between the obtained predicted words and words in a constructed representative word set, judging whether the covered words are knowledge entities or not, and determining the categories of the covered words.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention adopts an unsupervised method, starts with text data which is not marked completely, avoids manual marking work on the data, can greatly save labor cost in a knowledge entity identification task in scientific literature in a specific field, and provides a solution for the condition that structured marking data is lacking in a low-resource field;

(2) According to the invention, under the condition that no structured data set is relied on, the thought of contrast learning is combined, the word vector clustering method is used for constructing the knowledge entity representative word set, and in the process, the characteristics of the training model are utilized for creatively carrying out data enhancement conversion to construct new word vectors, so that the clustering accuracy is improved to a certain extent, and the representative word and category set with good effects can be obtained to serve as the guiding basis of the recognition method.

Drawings

FIG. 1 is a block diagram of a system of the present invention;

FIG. 2 is a schematic flow chart of a pre-training module according to the present invention;

FIG. 3 is a schematic flow chart of a learning module of a knowledge entity class representative word according to the present invention;

FIG. 4 is a flow chart of a knowledge entity recognition module according to the present invention;

FIG. 5 is a diagram of a network framework for training the comparative learning structural model in S25 of the present invention;

Fig. 6 is a schematic diagram of entity identification and classification in S3 according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Example 1

As shown in fig. 1 to 6, the invention provides an unsupervised identification method and system for scientific literature knowledge entities, which starts from unmarked text data, constructs a knowledge entity representative word set by combining a comparison learning and clustering method to serve as a judgment basis, and identifies the knowledge entities in the literature text by combining a whole word covering model, so that manual labeling work in traditional knowledge entity identification is avoided, time cost and manpower resources are saved, and an executable unsupervised identification method is provided for knowledge entity identification in the field of low-resource science.

An unsupervised recognition system for scientific literature knowledge entities comprises a pre-training module, a knowledge entity category representative word learning module and a knowledge entity recognition module:

the pre-training module is used for: collecting literature data, processing the collected unlabeled scientific literature text data to obtain training corpus of a full-word covering model (BERT-WWM model), constructing a domain dictionary by combining a serial frequency statistical algorithm, then inputting the full-word covering model to train the full-word covering model after word segmentation processing of the training corpus by taking the dictionary as a guide, and enabling the model to learn the contextual semantics and grammatical features of words in the related scientific domain;

The knowledge entity category representative word learning module is used for: inputting training corpus combined with dictionary word segmentation in a pre-training module into a word vector representation model for training to obtain vector representation of words in the dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and categories thereof, wherein the set is used as a basis for judging whether text words are knowledge entities in a recognition process;

the knowledge entity identification module is used for: masking nouns in a scientific literature text to be detected, predicting masking words by using a trained full word masking model, and then calculating similarity scores between the obtained predicted words and words in a constructed representative word set so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words.

When in operation, the method specifically comprises the following steps:

s1, the purpose of the pre-training module is as follows: on one hand, acquiring and processing literature text in the appointed field, and providing corpus data for a knowledge entity category representative word learning module; on the other hand, a predictive model is provided for the knowledge entity recognition module by pre-training contextual representations of text words of the learning literature by full word masking techniques (Whole Word Masking, WWM).

The method comprises the following specific steps:

s11, collecting titles, keywords and abstract data of related scientific field documents from a public database by utilizing a crawler technology to form basic corpus data, adding the keywords to a field dictionary after removing the keywords repeatedly and manually removing words which obviously do not belong to a knowledge entity, wherein the initial frequency of the words is the counted repetition number, and splicing the titles and the abstract data to form the basic corpus;

s13, performing word segmentation processing by combining the basic corpus with a domain dictionary, performing full-word covering processing on words appearing in the domain dictionary, and training by adopting a full-word covering model so as to enable the full-word covering model to obtain the context semantic representation of the words in the domain;

further, the basic idea of the N-gram algorithm adopted in step S12 is to perform sliding window operation of size N on the text content according to the byte stream, so as to form a byte fragment sequence of length N. Each byte segment is called a gram, the occurrence frequency of all the grams is counted, and the strings with the lengths and the frequencies meeting the requirements are obtained by filtering according to a preset threshold value and a preset rule. Here we consider that the minimum byte length of the knowledge entity is 2, the maximum is 10, and the minimum frequency is 2.

Further, the step of adding the word segmentation is needed in the model pre-training and the word vector representation for Chinese, so that a domain dictionary for guiding the word segmentation is needed to be constructed. The N-gram algorithm based on the string frequency statistics is selected for constructing the dictionary, and the problem of word boundary recognized by new words is not needed to be deeply studied in the requirement of the invention, and only the word segmentation result needs to contain target words as much as possible.

Further, the whole word covering model in step S13 adopts a BERT-WWM model, which is an upgrade BERT, and can predict covered words, and mainly changes the training sample generation strategy in the BERT pre-training stage:

BERT covers in word units, thus potentially covering the "… material damage determination …" as the "… material [ MASK ] damage determination …", while BERT-WWM covers in whole words, and covers text as the "… material [ MASK ] determination …", thus the trained model is more accurate for word prediction at the covering place;

s2, the purpose of the knowledge entity category representative word learning module is as follows: and (3) inputting the training corpus combined with dictionary word segmentation in the step (S13) into a word vector representation model for training to obtain the vector representation of the words in the dictionary, clustering word vector data by combining a contrast learning method, and constructing a small set of domain knowledge entity representative words and categories thereof to provide a judgment basis for a knowledge entity recognition module.

The entity class and representative word set construction method comprises the following specific steps:

s21, training the word vector representation model of the training corpus processed by word segmentation in S13, extracting word vectors of words in a domain dictionary, performing twice data enhancement conversion on the screened word vectors to obtain two new word vectors with the same category as the original word vectors but different values, wherein the new word vectors are positive samples;

s22, generating two new word vectors and word vectors after any other data enhancement in the space, wherein N word vectors are arranged before the data enhancement, namely the two new word vectors and 2N-2 other word vectors are negative samples;

s23, relearning and characterizing the word vectors after data enhancement through a contrast learning structural model, mapping the word vectors into a new characterization space, wherein the distances between positive class samples are more and more near when the loss function is used, and the distances between negative class samples are more and more far, so that the sample word vectors can be dispersed and uniformly distributed in the new characterization space as much as possible;

s24, clustering the re-characterized word vectors (such as a K-means algorithm), calculating semantic similarity (such as cosine similarity) between the cluster center and other words after clustering, setting a threshold, screening entity words with semantic similarity larger than the threshold, and removing part of words with excessive semantic difference so as to obtain a required knowledge entity representing word set, wherein the category information of each cluster is obtained by manually observing specific word information in each cluster after clustering.

Further, in step S21, word vectors are represented by model learning, and Word2Vec and BERT are commonly used as Word vector representation models, where Word2Vec is selected because: the word vector of BERT focuses on reflecting the context information of the words, whereas the construction of the set of representative words in the method of the present invention focuses more on the semantic representation of the words themselves.

Further, the selection of the data enhancement conversion mode in the step S21 is a core link for constructing the contrast learning framework. The invention adopts a mode that training samples are input into a model twice to obtain two numerically different feature vectors, and the method is described in detail as follows:

the training sample is input again to Word2Vec for training twice, and the vector representation of the required Word is extracted, and because of randomness of each model training, even if training parameter settings are consistent, the same Word can obtain two numerically different Word vectors.

This is because Word2Vec training is based on random initialization, with different random seeds used for each training, which may result in different initial Word vectors, i.e., the Word vector relative position in space is constant, but the absolute position of each result may be different. But because the training is based on the same corpus, words and phrases Is similar in semantic features. I.e. original sample word

Obtaining +.>

and />

，/>

and />

And->

There is only a difference in the magnitude of the vector between but the sample +.>

Characteristic information of semantics and category of (2) and thus +.>

And->

and />

The two are positive class samples, and the entity words belong to the same class.

Further, the comparison learning network structure adopted in the step S23 is described in detail as follows:

vector of original sample words

Two new word vectors obtained after data enhancement conversion ++>

and />

After passing through the feature Encoder Encoder, it is converted into the corresponding feature vector +.>

and />

The network structure consists of two fully connected layers (Fully Connected Layer, FC) and a nonlinear activation function Tanh, as a function +.>

And (3) representing. Subsequently, another nonlinear transformation structure Projector is provided, which is further added +.>

and />

Vector mapped into another space->

and />

The full connection layer (FC), the batch normalization (Batch Normalization, BN) and the nonlinear activation function (ReLU) are adopted to form the full connection layer, and the specific structure is +.>

By a function->

And (3) representing. For data pair->

Are mutually positive examples and +.>

And

and any 2N-2 vectors in the space are negative examples. In the course of->

After transformation, the enhancement vector is projected into the new representation space. In the new representation space, it is desirable that the positive examples are closer and the negative examples are farther. This is accomplished by defining a suitable loss function, and the criterion for determining whether the spatial distance is far or near is a semantic similarity measure. The specific loss function is as follows:

；

；/>

；

wherein ,

、/>

、/>

a number representing a sample; />

Representation->

and />

Loss of the composed sample pair; />

The expression number is->

Through comparison of the sample of (C)Vector after model conversion, ++>

The expression number is->

The expression number is->

When (I)>

The value is 1, otherwise, the value is 0; />

Indicating temperature super-parameters; />

Representing the final loss function.

wherein ,

the numerator part is used to describe the sample similarity of the positive examples, the denominator part represents the sum of the similarity of the current sample and other samples in the batch size (the number of samples selected for one training), i.e. the sample ∈is represented by the formula >

and />

Is a similar probability of (1). Wherein->

Indicating sample passing->

Vector representation after transformation +_>

Representing a temperature overshoot (which can scale the input and expand the range of cosine similarity), for controlling the sensitivity of loss to negative sample pairs,

meaning that semantic similarity is solved for both vectors. L represents the loss of all pairs and averages, where 2N represents the pre-processed 2N samples of N samples in the original batch size. />

: calculate the loss of all pairings and average

Further, the K-means algorithm adopted in the step S24 is described in detail as follows:

(1) K words are selected from the word vector space re-characterized after the contrast learning to serve as initial cluster centers, namely cluster centers;

(2) Calculating the distance between all other word vectors in the word vector space and the center of each cluster, and dividing each sample word into clusters closest to each other, wherein the probability that the sample word corresponding to the word vector belongs to the cluster class is considered to be larger as the distance between the word vector and the center of each cluster is closer;

(3) After all sample words in the vector space are calculated, calculating the average value of all the sample words in each cluster to serve as a new cluster center, and updating the original cluster center;

(4) And (3) repeating the steps (1) to (3) until the cluster center is not changed, namely convergence, and finishing training.

The mean vector calculation formula of the sample words in the step (3) is as follows:

；

；

in the formula ,

for the sample word vector, ++>

For category->

Is a number of samples of (a).

It should be further noted that, the setting scheme of the cluster number K adopts a contour coefficient method for reverse evaluation, specifically as follows:

it is assumed that the data to be classified has been clustered by a clustering algorithm, and K clusters are finally obtained. For each sample word in each cluster

The profile coefficients are calculated separately. Specifically, the +/for each sample word is required>

The following two indices are calculated:

（1）

: sample word->

Average value of distance to other sample words belonging to the same cluster. />

The smaller the instruction of the sample word +.>

The greater the likelihood of belonging to that category. />

（2）

: sample word->

To other clusters->

Average distance of all samples in +.>

Minimum value of (2), i.e

；

Then the sample word

The profile coefficients of (a) are:

；

and all sample words

The average value of the profile coefficients of (2) is the average profile coefficient of the clustering result +.>

。

The closer the distance between samples in the clusters is, the farther the distance between samples in the clusters is, the larger the average contour coefficient is, and the better the clustering effect is. So k with the largest average profile factor is the best cluster number.

S3, the purpose of the knowledge entity identification module is as follows: and (3) combining the predictive model obtained by pre-training in the step (S1) with the knowledge entity category and the representative word set constructed in the step (S2), and identifying the knowledge entity in the text to be detected. The method comprises the following specific steps:

) And covering;

s32, predicting the covered word by using the BERT-WWM whole word covering model obtained in S14

) Possible output words

；

S33, combining the domain knowledge entity representative word set obtained in the S2 module to calculate covered words

Belongs to category->

Score of->

；

S34, setting a threshold, and when the score is

Otherwise, the cover term is deemed not to be a knowledge entity.

Further, the words are classified according to the masking

Score of->

The method for judging whether the cover word is a knowledge entity is described in detail as follows:

step one, dividing words of the text of the data to be recognized by combining with a domain dictionary, recognizing nouns in the words, and marking the words as

For->

Masking, and predicting the masking part with trained BERT-WWM model>

. Setting threshold +.>

Predicted probability +.>

Vocabulary of->

Taking out, and calculating +.>

Are->

All representative words->

Is used to determine the average semantic similarity of (1). The extracted predictive word ++>

And entity class->

Weighted average is performed on the language similarity of (2) to finally obtain the mask word +. >

And entity class->

Semantic similarity +.>

The formula is as follows:

；

step two, setting the entity types containing more entity words to have larger weight, and setting weights for the entity types with different scales

Let the entity cluster contain +.>

The weight calculation formula is as follows:

；

it should be noted that: the logarithm of the number of elements in a category is considered here to reduce the impact of the number of elements on the weight.

Step three, calculating the score of the entity attribution type again

If there is a certain entity class->

So that->

Is greater than->

Then consider the cover word->

For the corresponding entity classOtherwise, the word is not considered an entity word. The specific score calculations are as follows:

。

(2) According to the invention, under the condition that no structured data set is relied on, the thought of contrast learning is combined, the word vector clustering method is used for constructing the knowledge entity representative word set, and in the process, the characteristics of the training model are utilized for creatively carrying out data enhancement conversion to construct new word vectors, so that the clustering accuracy is improved to a certain extent, and the representative word and class set with better effect can be obtained to serve as the guiding basis of the recognition method.

Example 2

As further optimization of embodiment 1, as shown in fig. 1 to 6, this embodiment further includes the following technical features on the basis of embodiment 1:

as shown in fig. 1, the embodiment of the invention provides an unsupervised recognition method for a knowledge entity of a scientific literature in the laser field, which comprises a pre-training module, a knowledge entity category representative word learning module and a knowledge entity recognition module. The pre-training module is used for constructing a topic dictionary in the laser field and learning context semantic and grammar characteristics of words in the laser field through a full word covering model; the knowledge entity category representative word learning module is used for clustering to construct a small-scale laser knowledge entity representative word set with definite types, so as to be used as a guiding basis for judging whether words in the text to be detected are knowledge entities in the knowledge entity recognition module; the knowledge entity recognition module is used for recognizing the knowledge entity in the text to be detected by combining the laser field dictionary and the laser knowledge entity representative word set.

According to fig. 2, the pre-training module can provide a laser domain dictionary required in the knowledge entity recognition process, a model training corpus required by the knowledge entity representative word learning module, and a BERT-WWM model with laser domain priori knowledge learned. The detailed implementation steps are as follows:

Step one, 6598 scientific literature on "laser damage" was collected in the public dataset. Splicing the title and abstract data, and filtering by using a Chinese general stop word list to obtain a pre-trained basic corpus; and storing the keywords in the field dictionary file by taking each word as one line, and performing duplication elimination treatment, wherein the obtained word repetition number is the initial frequency of the word.

Setting the maximum word length of the entity word as L=10, filtering basic corpus by using a Chinese general stop word list, and carrying out string frequency statistics on word strings with the length less than or equal to L and more than 2; statistics frequency of the string frequency is greater than a threshold value

If the word string exists in the initial domain dictionary, the word frequency is updated to be the sum of the initial word frequency and the statistical word frequency of the string frequency, and the new word string and the frequency thereof are directly added into the domain dictionary. And finally obtaining a final domain dictionary. Here set threshold +.>

The reason for (2) is that strings with a frequency of less than 2 are not considered to belong to the knowledge entity due to the occurrence of too low a number of occurrences. The present embodiment obtains 8884 entity words in total from the initial laser domain dictionary.

Step three, word segmentation is carried out on the basic corpus by utilizing a word segmentation tool (such as jieba) and combining with a domain dictionary; and taking the segmented corpus as an input corpus of the BERT-WWM model, performing full word covering processing on words appearing in the domain dictionary, and then performing model pre-training to enable the model to learn priori knowledge in the laser domain so as to obtain a prediction model required in the knowledge entity recognition module.

In order to save resources in the BERT-WWM model training, a two-stage pre-training mode is adopted, the length of a sentence in the first stage of pre-training is 128, and the length of a sentence in the second stage of pre-training is 512. The main pre-training tasks used are full word masking (Masked Language Model, MLM) and sentence-in-sentence prediction (Next Sentence Prediction, NSP), and since the tasks of the method of the present invention are unsupervised knowledge entity recognition, no task-level (e.g., classification) pre-training tasks are performed.

According to the illustration shown in fig. 3, the domain knowledge entity representative word learning module can obtain a small-scale definite type laser knowledge entity representative word set through a clustering algorithm, and provides a judgment basis for knowledge entity identification. The detailed implementation steps are as follows:

step one, using the corpus after Word segmentation in the pre-training module as the input corpus of a Word2Vec model to perform Word vector representation learning, wherein main parameters are set as follows: size=300 (Word vector dimension), window=5, min_count=2, sg=1 (using Skip-gram model), after training, saving the model and extracting Word vectors in the dictionary in the laser field, re-inputting training samples into Word2Vec for training twice to perform data enhancement, keeping training parameter settings consistent, distinguishing and marking Word vectors of the same words, and re-characterizing the Word vectors through a comparison learning network structure shown in fig. 5;

Step two, finally, clustering in a new characterization space by adopting a K-means algorithm, manually combining specific representative words to determine specific entity types represented by the clusters for each vocabulary of each cluster in a clustering result, and screening out semantic similarity larger than a set threshold (such as

) The entity words of (2) are used as laser knowledge entity representative word sets;

finally, the categories of the knowledge entities in the laser field are divided into: laser type (T), experimental theory (a), experimental resource (R), experimental operation (H), experimental result (O), and others (E), wherein the number of representative words of each category is determined by the set threshold size.

According to the illustration in fig. 4, the knowledge entity recognition module can recognize the laser knowledge entity in the text to be detected through a pre-trained predictive model. The detailed implementation steps are as follows:

step one, word segmentation is carried out on a text to be detected by combining a constructed laser field dictionary, nouns in the text are identified, each word is labeled by utilizing the classified entity categories, 100 nouns are labeled, and the text which can be used for testing is obtained;

step two, for the identified word [ MASK ] ]Masking, predicting possible words for masking part by using pre-trained BERT-WWM model

Calculate cover word and entity class ++>

Semantic similarity +.>

: as shown in FIG. 6, a threshold value is set +.>

(e.g. 0.6), predictive probability +.>

Vocabulary of->

Taking out, and calculating +.>

Are->

All representative words in (a)

And entity class->

Weighted average of the linguistic similarity of (2) to obtain the final +.>

，

Step three, calculating the score of the covering word belonging to each category by combining a formula

If there is a certain entity class +.>

So that->

If the covered word is larger than the set threshold value, judging that the covered word is a laser knowledge entity, and judging that the category is a corresponding entity category

Otherwise, not.

Finally, the feasibility of the invention is proved by identifying the correct words in the marked 100 words and simultaneously identifying 47 words with correct attribution categories.

As described above, the present invention can be preferably implemented.

All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.

The foregoing description of the preferred embodiment of the invention is not intended to limit the invention in any way, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims

1. The non-supervision identification method for the scientific literature knowledge entity is characterized in that a full-word covering model is pre-trained by utilizing unlabeled scientific literature text data, a set of knowledge entity representative words and categories thereof is constructed by combining a comparison learning and clustering method to serve as a judgment basis, then words in the scientific literature text are predicted by utilizing the pre-trained full-word covering model, whether the words in the scientific literature text are the knowledge entity or not is judged by calculating the similarity between the predicted words and the representative words, and the categories of the words in the scientific literature text are determined;

the method comprises the following steps:

S3, knowledge entity identification: masking words in a scientific literature text to be recognized, predicting masking words by using a trained full-word masking model, and then calculating similarity scores between the obtained predicted words and words in the representing word set constructed by the S2 so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words;

step S2 comprises the steps of:

2. The method for unsupervised identification of scientific literature-oriented knowledge entities according to claim 1, wherein in step S23, the loss function is as follows:

；

；/>

；

wherein ,

、/>

、/>

a number representing a sample; />

Representation->

and />

Loss of the composed sample pair; />

The expression number is->

The expression number is->

The expression number is->

When (I)>

The value is 1, otherwise, the value is 0; />

Representing a temperature parameter for controlling the degree of uniformity of the sample distribution; / >

Representing the final loss function.

3. The unsupervised recognition method for scientific literature knowledge entity according to claim 2, wherein in step S24, the re-characterized word vectors are clustered by using a K-means algorithm, and the method comprises the following steps:

；

in the formula ,

；

wherein ,

mean vector representing sample word, ++>

Representing a cluster->

Representation->

A certain vector in the cluster, < >>

Representation of

The number of sample words of the cluster;

4. The unsupervised identification method for scientific literature knowledge entity according to claim 3, wherein the setting scheme of the cluster number K is as follows:

: average distance of sample point to all samples in other clusters +.>

Minimum value->

The calculation formula of (2) is as follows:

；

sample point

The profile coefficients of (a) are:

；/>

wherein ,

representing sample points->

Is a contour coefficient of (2);

，/>

The method comprises the steps of carrying out a first treatment on the surface of the The closer the distance of samples within a cluster, the more the distance of samples between clustersThe farther the average profile factor is, the greater the clustering effect is.

5. The method for unsupervised identification of scientific literature-oriented knowledge entities according to claim 4, wherein S3 comprises the steps of:

And covering;

；

Belongs to category->

Score in (a)

；

S34, setting a threshold, and when the score is

Otherwise, the cover term is deemed not to be a knowledge entity.

6. The method for unsupervised recognition of a scientific literature-oriented knowledge entity according to claim 5, wherein in step S33, the words are masked

Belongs to category->

Score of->

The calculation method of (2) is as follows:

s341, masking words by using the pre-trained whole word masking model

Predicting possible words ++>

Predicted probability +.>

Vocabulary of->

Taking out, and calculating the +.>

Are->

All representative words->

Average semantic similarity of (2); the extracted predictive word ++>

And entity class->

And entity class->

Semantic similarity +.>

The formula is as follows:

；

wherein ,

Indicating that the entity cluster contains +.>

The weight calculation formula is as follows:

；

wherein ,

representation category->

The weight value given;

s343, recalculateCover word

Type of attribution->

Score of->

The calculation formula is as follows:

。

7. the method of unsupervised identification of a scientific literature-oriented knowledge entity according to any one of claims 1 to 6, wherein step S1 comprises the steps of:

8. An unsupervised identification system for scientific literature-oriented knowledge entity, which is used for implementing an unsupervised identification method for scientific literature-oriented knowledge entity according to any one of claims 1 to 7, and comprises the following modules connected in sequence: