CN116050419B - Unsupervised identification method and system oriented to scientific literature knowledge entity - Google Patents

Unsupervised identification method and system oriented to scientific literature knowledge entity Download PDF

Info

Publication number
CN116050419B
CN116050419B CN202310323198.6A CN202310323198A CN116050419B CN 116050419 B CN116050419 B CN 116050419B CN 202310323198 A CN202310323198 A CN 202310323198A CN 116050419 B CN116050419 B CN 116050419B
Authority
CN
China
Prior art keywords
word
words
entity
cluster
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310323198.6A
Other languages
Chinese (zh)
Other versions
CN116050419A (en
Inventor
张晖
兰浩宇
杨春明
陈洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University of Science and Technology
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN202310323198.6A priority Critical patent/CN116050419B/en
Publication of CN116050419A publication Critical patent/CN116050419A/en
Application granted granted Critical
Publication of CN116050419B publication Critical patent/CN116050419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of knowledge entity identification, and discloses an unsupervised identification method and system for a scientific literature knowledge entity. The method solves the problems that scientific text data resources which lack a public data set are difficult to identify when the knowledge entity is identified in the prior art.

Description

Unsupervised identification method and system oriented to scientific literature knowledge entity
Technical Field
The invention relates to the technical field of knowledge entity identification, in particular to an unsupervised identification method and an unsupervised identification system for scientific literature knowledge entities.
Background
The knowledge entity in the scientific literature refers to a term entity capable of expressing a key knowledge point in the professional literature, and contains rich scientific knowledge. In recent years, the recognition and extraction of knowledge entities in scientific literature are widely focused, and conferences related to the subject, such as "a scientific literature knowledge entity extraction and evaluation seminar", "a scientific text natural language processing seminar", etc., are successively held, so as to aim at discussing how to accurately and comprehensively recognize and extract knowledge entities from scientific texts, which has important significance for constructing a knowledge system in a specific scientific field.
Currently, in related research on knowledge entity and its category identification and extraction, the main method mainly includes: methods based on manual extraction, dictionary and rule based methods, traditional machine learning based methods, and deep learning based methods. The better research work is carried out under the supervision or semi-supervision condition, which needs a large amount of high-quality labeling data as a corpus basis, however, the specific scientific field often lacks such labeling data as support, and manual intervention is needed to complete the data labeling work. Because the division of the types of the knowledge entities has no fixed standard due to different fields, the types of the knowledge entities can be generally divided into entity categories such as method categories, tool categories, theory categories, resource categories and the like, so that non-field experts cannot carry out corpus labeling work, and the time and human resource cost are greatly improved.
The existing unsupervised knowledge entity identification method is still in an exploration stage, and the effect is not superior to that of the supervised learning method, but the manual labeling work can be avoided. The basic idea principle of the method is that a set of entity and category representative words is constructed by using the disclosed structured data (electric service manual) to act as a guiding basis, meanwhile, the words in the text are predicted by using the full word covering technology, and then the similarity between the text words and the representative words is calculated, so that the named entity identification and the type judgment are completed. In the method, the construction of the representative word set as a guiding basis needs the support of the disclosed structured data, but the specific scientific field lacks of the disclosed data set, and only text data resources without labels exist, so that the method cannot be directly migrated to knowledge entity identification of scientific literature.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an unsupervised identification method and an unsupervised identification system for a scientific literature knowledge entity, which solve the problems that scientific text data resources lack of a public data set are difficult to identify when the knowledge entity is identified in the prior art.
The invention solves the problems by adopting the following technical scheme:
the method comprises the steps of pre-training a full-word covering model by utilizing unlabeled scientific document text data, constructing a set of representative words and categories thereof of the knowledge entity by combining a contrast learning and clustering method as a judgment basis, predicting words in a scientific document text by utilizing the pre-trained full-word covering model, judging whether the words in the scientific document text are the knowledge entity by calculating the similarity between the predicted words and the representative words, and determining the categories of the words in the scientific document text.
As a preferred technical scheme, the method comprises the following steps:
s1, pre-training: processing the collected unlabeled scientific literature text data to obtain a training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, then inputting the full-word covering model to train the full-word covering model after the training corpus is subjected to word segmentation processing by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;
S2, learning the knowledge entity category representative word: inputting the training corpus after word segmentation in the S1 by combining with the domain dictionary into a word vector representation model for training to obtain the vector representation of the words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and the categories thereof, wherein the set is used as a basis for judging whether the text words are the knowledge entities in the recognition flow;
s3, knowledge entity identification: masking the words in the scientific literature text to be recognized, predicting the masking words by using a trained full-word masking model, and then calculating similarity scores between the obtained predicted words and the words in the representative word set constructed by the S2 so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words.
As a preferred technical solution, step S2 includes the following steps:
s21, training the word vector representation model of the training corpus after word segmentation, extracting word vectors of words in the domain dictionary, and carrying out twice data enhancement on the extracted word vectors to obtain two new word vectors which are the same as the original word vector and semantic features but have different values, wherein the new word vectors are positive samples;
S22, the generated two new word vectors and any other word vector after data enhancement are negative examples, N word vectors are recorded before two times of data enhancement, 2N word vectors are recorded in the word vector space after two times of data enhancement, and then the two new word vectors and 2N-2 other word vectors are negative examples;
s23, relearning and representing the word vectors subjected to two times of data enhancement into a new vector space through a comparison learning structure model, wherein the distance between positive class samples is limited to be more and more near by using a loss function in the space, and the distance between negative class samples is more and more far, so that the word vectors can be dispersed and uniformly distributed in the new representation space as much as possible;
s24, clustering the re-characterized word vectors, calculating the semantic similarity between the cluster center and other words after clustering, setting a threshold value, screening entity words with the semantic similarity larger than the set threshold value, and simultaneously combining specific entity words in each cluster to determine the category represented by the cluster, so that the required knowledge entity representing word set is obtained.
As a preferred embodiment, in step S23, the loss function is as follows:
Figure SMS_1
Figure SMS_2
Figure SMS_3
wherein ,
Figure SMS_19
、/>
Figure SMS_8
、/>
Figure SMS_13
a number representing a sample; />
Figure SMS_9
Representation->
Figure SMS_15
and />
Figure SMS_14
Loss of the composed sample pair; / >
Figure SMS_17
The representation number is
Figure SMS_16
Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>
Figure SMS_20
The expression number is->
Figure SMS_4
Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>
Figure SMS_10
The expression number is->
Figure SMS_7
The vector of the sample converted by the contrast learning structural model; />
Figure SMS_12
Representing the similarity of the two samples, and calculating by adopting cosine similarity; n represents the total number of samples before data enhancement; />
Figure SMS_18
Representing the adjustment parameters, the value of 0 or 1, representing when +.>
Figure SMS_21
When (I)>
Figure SMS_5
The value is 1, otherwise, the value is 0; />
Figure SMS_11
Representing a temperature parameter for controlling the degree of uniformity of the sample distribution; />
Figure SMS_6
Representing the final loss function.
As a preferable technical scheme, in step S24, the K-means algorithm is adopted to cluster the re-characterized word vectors, and the method comprises the following steps:
s241, selecting K words in the word vector space re-characterized after contrast learning as initial cluster centers;
s242, calculating the distance between all other word vectors in the word vector space and the center of each cluster, and dividing each sample word into clusters closest to each other, wherein the probability that the sample word corresponding to the word vector belongs to the cluster class is considered to be larger as the distance between the word vector and the center of each cluster is smaller;
s243, after all sample words in the vector space are calculated, calculating the average value vector of all sample words in each cluster, taking the average value vector of all sample words in each cluster as a new cluster center, and updating the original cluster center; the mean vector calculation formula of the sample word is as follows:
Figure SMS_22
in the formula ,
Figure SMS_23
wherein ,
Figure SMS_24
mean vector representing sample word, ++>
Figure SMS_25
Representing a cluster->
Figure SMS_26
Representation->
Figure SMS_27
A certain vector in the cluster, < >>
Figure SMS_28
Representation->
Figure SMS_29
The number of sample words of the cluster;
s244, repeating the steps S241 to S243 until the cluster center is not changed, and completing training.
As a preferable embodiment, the cluster number K is set as follows:
assuming that the data to be classified are clustered through a clustering algorithm, and finally obtaining K clusters; for each sample word in each cluster, its contour coefficient is calculated separately, and for each sample word the following index is calculated:
Figure SMS_30
: average value of distances from sample points to other sample points belonging to the same cluster; />
Figure SMS_31
The smaller the value, the greater the likelihood that the sample point belongs to the category;
Figure SMS_32
: average distance of sample point to all samples in other clusters +.>
Figure SMS_33
Minimum value->
Figure SMS_34
The calculation formula of (2) is as follows:
Figure SMS_35
sample point
Figure SMS_36
The profile coefficients of (a) are:
Figure SMS_37
wherein ,
Figure SMS_38
representing sample points->
Figure SMS_39
Is a contour coefficient of (2);
the average value of the contour coefficients of all sample points is the average contour coefficient of the clustering result
Figure SMS_40
,/>
Figure SMS_41
The method comprises the steps of carrying out a first treatment on the surface of the The closer the distance between samples in a cluster, the more samples between clustersThe farther the distance is, the larger the average contour coefficient is, and the better the clustering effect is.
As a preferred technical solution, S3 comprises the following steps:
s31, word segmentation is carried out on the text to be detected, and nouns in the text are recognized
Figure SMS_42
And covering;
s32, predicting possible output words of the covered words by using the whole word covering model obtained in S14
Figure SMS_43
S33, combining the knowledge entity representative word set obtained in the S2, and calculating covered words
Figure SMS_44
Belongs to category->
Figure SMS_45
Score of->
Figure SMS_46
S34, setting a threshold, and when the score is
Figure SMS_47
If the threshold value is greater than the threshold value, the covering word is determined as a knowledge entity and belongs to the corresponding entity class +.>
Figure SMS_48
Otherwise, the cover term is deemed not to be a knowledge entity. />
As a preferred embodiment, in step S33, the word is covered
Figure SMS_49
Belongs to category->
Figure SMS_50
Score of->
Figure SMS_51
The calculation method of (2) is as follows:
s341, masking words by using the pre-trained whole word masking model
Figure SMS_53
Predicting possible words ++>
Figure SMS_57
The method comprises the steps of carrying out a first treatment on the surface of the Setting threshold +.>
Figure SMS_61
Predicted probability +.>
Figure SMS_54
Vocabulary of->
Figure SMS_58
Taking out, and calculating the +.>
Figure SMS_62
Are->
Figure SMS_64
All representative words->
Figure SMS_52
Average semantic similarity of (2); the extracted predictive word ++>
Figure SMS_56
And entity class->
Figure SMS_60
Weighted average is performed on the semantic similarity of (2) to finally obtain the cover word +.>
Figure SMS_63
And entity class->
Figure SMS_55
Semantic similarity +.>
Figure SMS_59
The formula is as follows:
Figure SMS_65
wherein ,
Figure SMS_66
The number of the representative words contained in a certain category in the representative word set;
s342, setting the entity types containing more entity words to have larger weights, setting different weights for different entity types in the representative word set, using
Figure SMS_67
Indicating that the entity cluster contains +.>
Figure SMS_68
The weight calculation formula is as follows:
Figure SMS_69
wherein ,
Figure SMS_70
representation category->
Figure SMS_71
The weight value given;
s343, recalculate the cover word
Figure SMS_72
Type of attribution->
Figure SMS_73
Score of->
Figure SMS_74
The calculation formula is as follows:
Figure SMS_75
as a preferred technical solution, step S1 includes the following steps:
s11, collecting title, key words and abstract data of related scientific field documents from a public database to form basic corpus data, adding the key words to a field dictionary after removing the key words repeatedly and manually to remove words which obviously do not belong to a knowledge entity, and splicing the title and abstract data to form the basic corpus;
s12, extracting character strings with frequency in a specified range from the basic corpus by adopting an N-gram string frequency statistical algorithm, then, updating the frequency of the character string words of the existing domain dictionary, and directly adding the character string words which do not appear in the domain dictionary and the frequency thereof into the domain dictionary;
s13, performing word segmentation processing by combining the basic corpus with the domain dictionary, performing full-word covering processing on words appearing in the domain dictionary, and training by adopting a full-word covering model so that the full-word covering model obtains the context semantic representation of the words in the domain.
The unsupervised identification system for the scientific literature knowledge entity is used for realizing the unsupervised identification method for the scientific literature knowledge entity, and comprises the following modules connected in sequence:
the pre-training module: the method comprises the steps of processing collected unlabeled scientific literature text data to obtain training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, inputting the full-word covering model to train the full-word covering model after word segmentation processing is carried out on the training corpus by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;
the knowledge entity category represents a word learning module: the method comprises the steps of inputting training corpus after word segmentation in combination with a domain dictionary into a word vector representation model for training to obtain vector representations of words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and categories thereof, wherein the set is used as a basis for judging whether text words are knowledge entities in a recognition process;
knowledge entity identification module: the method comprises the steps of carrying out covering processing on words in a scientific literature text to be recognized, predicting the covered words by using a trained full word covering model, calculating similarity scores between the obtained predicted words and words in a constructed representative word set, judging whether the covered words are knowledge entities or not, and determining the categories of the covered words.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention adopts an unsupervised method, starts with text data which is not marked completely, avoids manual marking work on the data, can greatly save labor cost in a knowledge entity identification task in scientific literature in a specific field, and provides a solution for the condition that structured marking data is lacking in a low-resource field;
(2) According to the invention, under the condition that no structured data set is relied on, the thought of contrast learning is combined, the word vector clustering method is used for constructing the knowledge entity representative word set, and in the process, the characteristics of the training model are utilized for creatively carrying out data enhancement conversion to construct new word vectors, so that the clustering accuracy is improved to a certain extent, and the representative word and category set with good effects can be obtained to serve as the guiding basis of the recognition method.
Drawings
FIG. 1 is a block diagram of a system of the present invention;
FIG. 2 is a schematic flow chart of a pre-training module according to the present invention;
FIG. 3 is a schematic flow chart of a learning module of a knowledge entity class representative word according to the present invention;
FIG. 4 is a flow chart of a knowledge entity recognition module according to the present invention;
FIG. 5 is a diagram of a network framework for training the comparative learning structural model in S25 of the present invention;
Fig. 6 is a schematic diagram of entity identification and classification in S3 according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1
As shown in fig. 1 to 6, the invention provides an unsupervised identification method and system for scientific literature knowledge entities, which starts from unmarked text data, constructs a knowledge entity representative word set by combining a comparison learning and clustering method to serve as a judgment basis, and identifies the knowledge entities in the literature text by combining a whole word covering model, so that manual labeling work in traditional knowledge entity identification is avoided, time cost and manpower resources are saved, and an executable unsupervised identification method is provided for knowledge entity identification in the field of low-resource science.
An unsupervised recognition system for scientific literature knowledge entities comprises a pre-training module, a knowledge entity category representative word learning module and a knowledge entity recognition module:
the pre-training module is used for: collecting literature data, processing the collected unlabeled scientific literature text data to obtain training corpus of a full-word covering model (BERT-WWM model), constructing a domain dictionary by combining a serial frequency statistical algorithm, then inputting the full-word covering model to train the full-word covering model after word segmentation processing of the training corpus by taking the dictionary as a guide, and enabling the model to learn the contextual semantics and grammatical features of words in the related scientific domain;
The knowledge entity category representative word learning module is used for: inputting training corpus combined with dictionary word segmentation in a pre-training module into a word vector representation model for training to obtain vector representation of words in the dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and categories thereof, wherein the set is used as a basis for judging whether text words are knowledge entities in a recognition process;
the knowledge entity identification module is used for: masking nouns in a scientific literature text to be detected, predicting masking words by using a trained full word masking model, and then calculating similarity scores between the obtained predicted words and words in a constructed representative word set so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words.
When in operation, the method specifically comprises the following steps:
s1, the purpose of the pre-training module is as follows: on one hand, acquiring and processing literature text in the appointed field, and providing corpus data for a knowledge entity category representative word learning module; on the other hand, a predictive model is provided for the knowledge entity recognition module by pre-training contextual representations of text words of the learning literature by full word masking techniques (Whole Word Masking, WWM).
The method comprises the following specific steps:
s11, collecting titles, keywords and abstract data of related scientific field documents from a public database by utilizing a crawler technology to form basic corpus data, adding the keywords to a field dictionary after removing the keywords repeatedly and manually removing words which obviously do not belong to a knowledge entity, wherein the initial frequency of the words is the counted repetition number, and splicing the titles and the abstract data to form the basic corpus;
s12, extracting character strings with frequency in a specified range from the basic corpus by adopting an N-gram string frequency statistical algorithm, then, updating the frequency of the character string words of the existing domain dictionary, and directly adding the character string words which do not appear in the domain dictionary and the frequency thereof into the domain dictionary;
s13, performing word segmentation processing by combining the basic corpus with a domain dictionary, performing full-word covering processing on words appearing in the domain dictionary, and training by adopting a full-word covering model so as to enable the full-word covering model to obtain the context semantic representation of the words in the domain;
further, the basic idea of the N-gram algorithm adopted in step S12 is to perform sliding window operation of size N on the text content according to the byte stream, so as to form a byte fragment sequence of length N. Each byte segment is called a gram, the occurrence frequency of all the grams is counted, and the strings with the lengths and the frequencies meeting the requirements are obtained by filtering according to a preset threshold value and a preset rule. Here we consider that the minimum byte length of the knowledge entity is 2, the maximum is 10, and the minimum frequency is 2.
Further, the step of adding the word segmentation is needed in the model pre-training and the word vector representation for Chinese, so that a domain dictionary for guiding the word segmentation is needed to be constructed. The N-gram algorithm based on the string frequency statistics is selected for constructing the dictionary, and the problem of word boundary recognized by new words is not needed to be deeply studied in the requirement of the invention, and only the word segmentation result needs to contain target words as much as possible.
Further, the whole word covering model in step S13 adopts a BERT-WWM model, which is an upgrade BERT, and can predict covered words, and mainly changes the training sample generation strategy in the BERT pre-training stage:
BERT covers in word units, thus potentially covering the "… material damage determination …" as the "… material [ MASK ] damage determination …", while BERT-WWM covers in whole words, and covers text as the "… material [ MASK ] determination …", thus the trained model is more accurate for word prediction at the covering place;
s2, the purpose of the knowledge entity category representative word learning module is as follows: and (3) inputting the training corpus combined with dictionary word segmentation in the step (S13) into a word vector representation model for training to obtain the vector representation of the words in the dictionary, clustering word vector data by combining a contrast learning method, and constructing a small set of domain knowledge entity representative words and categories thereof to provide a judgment basis for a knowledge entity recognition module.
The entity class and representative word set construction method comprises the following specific steps:
s21, training the word vector representation model of the training corpus processed by word segmentation in S13, extracting word vectors of words in a domain dictionary, performing twice data enhancement conversion on the screened word vectors to obtain two new word vectors with the same category as the original word vectors but different values, wherein the new word vectors are positive samples;
s22, generating two new word vectors and word vectors after any other data enhancement in the space, wherein N word vectors are arranged before the data enhancement, namely the two new word vectors and 2N-2 other word vectors are negative samples;
s23, relearning and characterizing the word vectors after data enhancement through a contrast learning structural model, mapping the word vectors into a new characterization space, wherein the distances between positive class samples are more and more near when the loss function is used, and the distances between negative class samples are more and more far, so that the sample word vectors can be dispersed and uniformly distributed in the new characterization space as much as possible;
s24, clustering the re-characterized word vectors (such as a K-means algorithm), calculating semantic similarity (such as cosine similarity) between the cluster center and other words after clustering, setting a threshold, screening entity words with semantic similarity larger than the threshold, and removing part of words with excessive semantic difference so as to obtain a required knowledge entity representing word set, wherein the category information of each cluster is obtained by manually observing specific word information in each cluster after clustering.
Further, in step S21, word vectors are represented by model learning, and Word2Vec and BERT are commonly used as Word vector representation models, where Word2Vec is selected because: the word vector of BERT focuses on reflecting the context information of the words, whereas the construction of the set of representative words in the method of the present invention focuses more on the semantic representation of the words themselves.
Further, the selection of the data enhancement conversion mode in the step S21 is a core link for constructing the contrast learning framework. The invention adopts a mode that training samples are input into a model twice to obtain two numerically different feature vectors, and the method is described in detail as follows:
the training sample is input again to Word2Vec for training twice, and the vector representation of the required Word is extracted, and because of randomness of each model training, even if training parameter settings are consistent, the same Word can obtain two numerically different Word vectors.
This is because Word2Vec training is based on random initialization, with different random seeds used for each training, which may result in different initial Word vectors, i.e., the Word vector relative position in space is constant, but the absolute position of each result may be different. But because the training is based on the same corpus, words and phrases Is similar in semantic features. I.e. original sample word
Figure SMS_77
Obtaining +.>
Figure SMS_81
and />
Figure SMS_84
,/>
Figure SMS_78
and />
Figure SMS_80
And->
Figure SMS_83
There is only a difference in the magnitude of the vector between but the sample +.>
Figure SMS_85
Characteristic information of semantics and category of (2) and thus +.>
Figure SMS_76
And->
Figure SMS_79
and />
Figure SMS_82
The two are positive class samples, and the entity words belong to the same class.
Further, the comparison learning network structure adopted in the step S23 is described in detail as follows:
vector of original sample words
Figure SMS_96
Two new word vectors obtained after data enhancement conversion ++>
Figure SMS_87
and />
Figure SMS_95
After passing through the feature Encoder Encoder, it is converted into the corresponding feature vector +.>
Figure SMS_89
and />
Figure SMS_94
The network structure consists of two fully connected layers (Fully Connected Layer, FC) and a nonlinear activation function Tanh, as a function +.>
Figure SMS_91
And (3) representing. Subsequently, another nonlinear transformation structure Projector is provided, which is further added +.>
Figure SMS_97
and />
Figure SMS_99
Vector mapped into another space->
Figure SMS_101
and />
Figure SMS_86
The full connection layer (FC), the batch normalization (Batch Normalization, BN) and the nonlinear activation function (ReLU) are adopted to form the full connection layer, and the specific structure is +.>
Figure SMS_93
By a function->
Figure SMS_88
And (3) representing. For data pair->
Figure SMS_92
Are mutually positive examples and +.>
Figure SMS_98
And
Figure SMS_100
and any 2N-2 vectors in the space are negative examples. In the course of->
Figure SMS_90
After transformation, the enhancement vector is projected into the new representation space. In the new representation space, it is desirable that the positive examples are closer and the negative examples are farther. This is accomplished by defining a suitable loss function, and the criterion for determining whether the spatial distance is far or near is a semantic similarity measure. The specific loss function is as follows:
Figure SMS_102
Figure SMS_103
;/>
Figure SMS_104
wherein ,
Figure SMS_110
、/>
Figure SMS_106
、/>
Figure SMS_111
a number representing a sample; />
Figure SMS_108
Representation->
Figure SMS_113
and />
Figure SMS_118
Loss of the composed sample pair; />
Figure SMS_122
The expression number is->
Figure SMS_117
Through comparison of the sample of (C)Vector after model conversion, ++>
Figure SMS_121
The expression number is->
Figure SMS_105
Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>
Figure SMS_114
The expression number is->
Figure SMS_115
The vector of the sample converted by the contrast learning structural model; />
Figure SMS_119
Representing the similarity of the two samples, and calculating by adopting cosine similarity; n represents the total number of samples before data enhancement; />
Figure SMS_116
Representing the adjustment parameters, the value of 0 or 1, representing when +.>
Figure SMS_120
When (I)>
Figure SMS_107
The value is 1, otherwise, the value is 0; />
Figure SMS_112
Indicating temperature super-parameters; />
Figure SMS_109
Representing the final loss function.
wherein ,
Figure SMS_124
the numerator part is used to describe the sample similarity of the positive examples, the denominator part represents the sum of the similarity of the current sample and other samples in the batch size (the number of samples selected for one training), i.e. the sample ∈is represented by the formula >
Figure SMS_126
and />
Figure SMS_128
Is a similar probability of (1). Wherein->
Figure SMS_125
Indicating sample passing->
Figure SMS_127
Vector representation after transformation +_>
Figure SMS_129
Representing a temperature overshoot (which can scale the input and expand the range of cosine similarity), for controlling the sensitivity of loss to negative sample pairs,
Figure SMS_130
meaning that semantic similarity is solved for both vectors. L represents the loss of all pairs and averages, where 2N represents the pre-processed 2N samples of N samples in the original batch size. />
Figure SMS_123
: calculate the loss of all pairings and average
Further, the K-means algorithm adopted in the step S24 is described in detail as follows:
(1) K words are selected from the word vector space re-characterized after the contrast learning to serve as initial cluster centers, namely cluster centers;
(2) Calculating the distance between all other word vectors in the word vector space and the center of each cluster, and dividing each sample word into clusters closest to each other, wherein the probability that the sample word corresponding to the word vector belongs to the cluster class is considered to be larger as the distance between the word vector and the center of each cluster is closer;
(3) After all sample words in the vector space are calculated, calculating the average value of all the sample words in each cluster to serve as a new cluster center, and updating the original cluster center;
(4) And (3) repeating the steps (1) to (3) until the cluster center is not changed, namely convergence, and finishing training.
The mean vector calculation formula of the sample words in the step (3) is as follows:
Figure SMS_131
Figure SMS_132
in the formula ,
Figure SMS_133
for the sample word vector, ++>
Figure SMS_134
For category->
Figure SMS_135
Is a number of samples of (a).
It should be further noted that, the setting scheme of the cluster number K adopts a contour coefficient method for reverse evaluation, specifically as follows:
it is assumed that the data to be classified has been clustered by a clustering algorithm, and K clusters are finally obtained. For each sample word in each cluster
Figure SMS_136
The profile coefficients are calculated separately. Specifically, the +/for each sample word is required>
Figure SMS_137
The following two indices are calculated:
(1)
Figure SMS_138
: sample word->
Figure SMS_139
Average value of distance to other sample words belonging to the same cluster. />
Figure SMS_140
The smaller the instruction of the sample word +.>
Figure SMS_141
The greater the likelihood of belonging to that category. />
(2)
Figure SMS_142
: sample word->
Figure SMS_143
To other clusters->
Figure SMS_144
Average distance of all samples in +.>
Figure SMS_145
Minimum value of (2), i.e
Figure SMS_146
Then the sample word
Figure SMS_147
The profile coefficients of (a) are:
Figure SMS_148
and all sample words
Figure SMS_149
The average value of the profile coefficients of (2) is the average profile coefficient of the clustering result +.>
Figure SMS_150
Figure SMS_151
The closer the distance between samples in the clusters is, the farther the distance between samples in the clusters is, the larger the average contour coefficient is, and the better the clustering effect is. So k with the largest average profile factor is the best cluster number.
S3, the purpose of the knowledge entity identification module is as follows: and (3) combining the predictive model obtained by pre-training in the step (S1) with the knowledge entity category and the representative word set constructed in the step (S2), and identifying the knowledge entity in the text to be detected. The method comprises the following specific steps:
S31, word segmentation is carried out on the text to be detected, and nouns in the text are recognized
Figure SMS_152
) And covering;
s32, predicting the covered word by using the BERT-WWM whole word covering model obtained in S14
Figure SMS_153
) Possible output words
Figure SMS_154
S33, combining the domain knowledge entity representative word set obtained in the S2 module to calculate covered words
Figure SMS_155
Belongs to category->
Figure SMS_156
Score of->
Figure SMS_157
S34, setting a threshold, and when the score is
Figure SMS_158
If the threshold value is greater than the threshold value, the covering word is determined as a knowledge entity and belongs to the corresponding entity class +.>
Figure SMS_159
Otherwise, the cover term is deemed not to be a knowledge entity.
Further, the words are classified according to the masking
Figure SMS_160
Score of->
Figure SMS_161
The method for judging whether the cover word is a knowledge entity is described in detail as follows:
step one, dividing words of the text of the data to be recognized by combining with a domain dictionary, recognizing nouns in the words, and marking the words as
Figure SMS_165
For->
Figure SMS_169
Masking, and predicting the masking part with trained BERT-WWM model>
Figure SMS_172
. Setting threshold +.>
Figure SMS_163
Predicted probability +.>
Figure SMS_167
Vocabulary of->
Figure SMS_171
Taking out, and calculating +.>
Figure SMS_175
Are->
Figure SMS_162
All representative words->
Figure SMS_166
Is used to determine the average semantic similarity of (1). The extracted predictive word ++>
Figure SMS_170
And entity class->
Figure SMS_174
Weighted average is performed on the language similarity of (2) to finally obtain the mask word +. >
Figure SMS_164
And entity class->
Figure SMS_168
Semantic similarity +.>
Figure SMS_173
The formula is as follows:
Figure SMS_176
step two, setting the entity types containing more entity words to have larger weight, and setting weights for the entity types with different scales
Figure SMS_177
Let the entity cluster contain +.>
Figure SMS_178
The weight calculation formula is as follows:
Figure SMS_179
it should be noted that: the logarithm of the number of elements in a category is considered here to reduce the impact of the number of elements on the weight.
Step three, calculating the score of the entity attribution type again
Figure SMS_180
If there is a certain entity class->
Figure SMS_181
So that->
Figure SMS_182
Is greater than->
Figure SMS_183
Then consider the cover word->
Figure SMS_184
For the corresponding entity classOtherwise, the word is not considered an entity word. The specific score calculations are as follows:
Figure SMS_185
compared with the prior art, the invention has the following beneficial effects:
(1) The invention adopts an unsupervised method, starts with text data which is not marked completely, avoids manual marking work on the data, can greatly save labor cost in a knowledge entity identification task in scientific literature in a specific field, and provides a solution for the condition that structured marking data is lacking in a low-resource field;
(2) According to the invention, under the condition that no structured data set is relied on, the thought of contrast learning is combined, the word vector clustering method is used for constructing the knowledge entity representative word set, and in the process, the characteristics of the training model are utilized for creatively carrying out data enhancement conversion to construct new word vectors, so that the clustering accuracy is improved to a certain extent, and the representative word and class set with better effect can be obtained to serve as the guiding basis of the recognition method.
Example 2
As further optimization of embodiment 1, as shown in fig. 1 to 6, this embodiment further includes the following technical features on the basis of embodiment 1:
as shown in fig. 1, the embodiment of the invention provides an unsupervised recognition method for a knowledge entity of a scientific literature in the laser field, which comprises a pre-training module, a knowledge entity category representative word learning module and a knowledge entity recognition module. The pre-training module is used for constructing a topic dictionary in the laser field and learning context semantic and grammar characteristics of words in the laser field through a full word covering model; the knowledge entity category representative word learning module is used for clustering to construct a small-scale laser knowledge entity representative word set with definite types, so as to be used as a guiding basis for judging whether words in the text to be detected are knowledge entities in the knowledge entity recognition module; the knowledge entity recognition module is used for recognizing the knowledge entity in the text to be detected by combining the laser field dictionary and the laser knowledge entity representative word set.
According to fig. 2, the pre-training module can provide a laser domain dictionary required in the knowledge entity recognition process, a model training corpus required by the knowledge entity representative word learning module, and a BERT-WWM model with laser domain priori knowledge learned. The detailed implementation steps are as follows:
Step one, 6598 scientific literature on "laser damage" was collected in the public dataset. Splicing the title and abstract data, and filtering by using a Chinese general stop word list to obtain a pre-trained basic corpus; and storing the keywords in the field dictionary file by taking each word as one line, and performing duplication elimination treatment, wherein the obtained word repetition number is the initial frequency of the word.
Setting the maximum word length of the entity word as L=10, filtering basic corpus by using a Chinese general stop word list, and carrying out string frequency statistics on word strings with the length less than or equal to L and more than 2; statistics frequency of the string frequency is greater than a threshold value
Figure SMS_186
If the word string exists in the initial domain dictionary, the word frequency is updated to be the sum of the initial word frequency and the statistical word frequency of the string frequency, and the new word string and the frequency thereof are directly added into the domain dictionary. And finally obtaining a final domain dictionary. Here set threshold +.>
Figure SMS_187
The reason for (2) is that strings with a frequency of less than 2 are not considered to belong to the knowledge entity due to the occurrence of too low a number of occurrences. The present embodiment obtains 8884 entity words in total from the initial laser domain dictionary.
Step three, word segmentation is carried out on the basic corpus by utilizing a word segmentation tool (such as jieba) and combining with a domain dictionary; and taking the segmented corpus as an input corpus of the BERT-WWM model, performing full word covering processing on words appearing in the domain dictionary, and then performing model pre-training to enable the model to learn priori knowledge in the laser domain so as to obtain a prediction model required in the knowledge entity recognition module.
In order to save resources in the BERT-WWM model training, a two-stage pre-training mode is adopted, the length of a sentence in the first stage of pre-training is 128, and the length of a sentence in the second stage of pre-training is 512. The main pre-training tasks used are full word masking (Masked Language Model, MLM) and sentence-in-sentence prediction (Next Sentence Prediction, NSP), and since the tasks of the method of the present invention are unsupervised knowledge entity recognition, no task-level (e.g., classification) pre-training tasks are performed.
According to the illustration shown in fig. 3, the domain knowledge entity representative word learning module can obtain a small-scale definite type laser knowledge entity representative word set through a clustering algorithm, and provides a judgment basis for knowledge entity identification. The detailed implementation steps are as follows:
step one, using the corpus after Word segmentation in the pre-training module as the input corpus of a Word2Vec model to perform Word vector representation learning, wherein main parameters are set as follows: size=300 (Word vector dimension), window=5, min_count=2, sg=1 (using Skip-gram model), after training, saving the model and extracting Word vectors in the dictionary in the laser field, re-inputting training samples into Word2Vec for training twice to perform data enhancement, keeping training parameter settings consistent, distinguishing and marking Word vectors of the same words, and re-characterizing the Word vectors through a comparison learning network structure shown in fig. 5;
Step two, finally, clustering in a new characterization space by adopting a K-means algorithm, manually combining specific representative words to determine specific entity types represented by the clusters for each vocabulary of each cluster in a clustering result, and screening out semantic similarity larger than a set threshold (such as
Figure SMS_188
) The entity words of (2) are used as laser knowledge entity representative word sets;
finally, the categories of the knowledge entities in the laser field are divided into: laser type (T), experimental theory (a), experimental resource (R), experimental operation (H), experimental result (O), and others (E), wherein the number of representative words of each category is determined by the set threshold size.
According to the illustration in fig. 4, the knowledge entity recognition module can recognize the laser knowledge entity in the text to be detected through a pre-trained predictive model. The detailed implementation steps are as follows:
step one, word segmentation is carried out on a text to be detected by combining a constructed laser field dictionary, nouns in the text are identified, each word is labeled by utilizing the classified entity categories, 100 nouns are labeled, and the text which can be used for testing is obtained;
step two, for the identified word [ MASK ] ]Masking, predicting possible words for masking part by using pre-trained BERT-WWM model
Figure SMS_190
Calculate cover word and entity class ++>
Figure SMS_194
Semantic similarity +.>
Figure SMS_197
: as shown in FIG. 6, a threshold value is set +.>
Figure SMS_192
(e.g. 0.6), predictive probability +.>
Figure SMS_195
Vocabulary of->
Figure SMS_198
Taking out, and calculating +.>
Figure SMS_200
Are->
Figure SMS_189
All representative words in (a)
Figure SMS_193
Is used to determine the average semantic similarity of (1). The extracted predictive word ++>
Figure SMS_196
And entity class->
Figure SMS_199
Weighted average of the linguistic similarity of (2) to obtain the final +.>
Figure SMS_191
Step three, calculating the score of the covering word belonging to each category by combining a formula
Figure SMS_201
If there is a certain entity class +.>
Figure SMS_202
So that->
Figure SMS_203
If the covered word is larger than the set threshold value, judging that the covered word is a laser knowledge entity, and judging that the category is a corresponding entity category
Figure SMS_204
Otherwise, not.
Finally, the feasibility of the invention is proved by identifying the correct words in the marked 100 words and simultaneously identifying 47 words with correct attribution categories.
As described above, the present invention can be preferably implemented.
All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.
The foregoing description of the preferred embodiment of the invention is not intended to limit the invention in any way, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims (8)

1. The non-supervision identification method for the scientific literature knowledge entity is characterized in that a full-word covering model is pre-trained by utilizing unlabeled scientific literature text data, a set of knowledge entity representative words and categories thereof is constructed by combining a comparison learning and clustering method to serve as a judgment basis, then words in the scientific literature text are predicted by utilizing the pre-trained full-word covering model, whether the words in the scientific literature text are the knowledge entity or not is judged by calculating the similarity between the predicted words and the representative words, and the categories of the words in the scientific literature text are determined;
the method comprises the following steps:
s1, pre-training: processing the collected unlabeled scientific literature text data to obtain a training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, then inputting the full-word covering model to train the full-word covering model after the training corpus is subjected to word segmentation processing by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;
s2, learning the knowledge entity category representative word: inputting the training corpus after word segmentation in the S1 by combining with the domain dictionary into a word vector representation model for training to obtain the vector representation of the words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and the categories thereof, wherein the set is used as a basis for judging whether the text words are the knowledge entities in the recognition flow;
S3, knowledge entity identification: masking words in a scientific literature text to be recognized, predicting masking words by using a trained full-word masking model, and then calculating similarity scores between the obtained predicted words and words in the representing word set constructed by the S2 so as to judge whether the masking words are knowledge entities or not and determine the categories of the masking words;
step S2 comprises the steps of:
s21, training the word vector representation model of the training corpus after word segmentation, extracting word vectors of words in the domain dictionary, and carrying out twice data enhancement on the extracted word vectors to obtain two new word vectors which are the same as the original word vector and semantic features but have different values, wherein the new word vectors are positive samples;
s22, the generated two new word vectors and any other word vector after data enhancement are negative examples, N word vectors are recorded before two times of data enhancement, 2N word vectors are recorded in the word vector space after two times of data enhancement, and then the two new word vectors and 2N-2 other word vectors are negative examples;
s23, relearning and representing the word vectors subjected to two times of data enhancement into a new vector space through a comparison learning structure model, wherein the distance between positive class samples is limited to be more and more near by using a loss function in the space, and the distance between negative class samples is more and more far, so that the word vectors can be dispersed and uniformly distributed in the new representation space as much as possible;
S24, clustering the re-characterized word vectors, calculating the semantic similarity between the cluster center and other words after clustering, setting a threshold value, screening entity words with the semantic similarity larger than the set threshold value, and simultaneously combining specific entity words in each cluster to determine the category represented by the cluster, so that the required knowledge entity representing word set is obtained.
2. The method for unsupervised identification of scientific literature-oriented knowledge entities according to claim 1, wherein in step S23, the loss function is as follows:
Figure QLYQS_1
Figure QLYQS_2
;/>
Figure QLYQS_3
wherein ,
Figure QLYQS_8
、/>
Figure QLYQS_6
、/>
Figure QLYQS_11
a number representing a sample; />
Figure QLYQS_7
Representation->
Figure QLYQS_12
and />
Figure QLYQS_9
Loss of the composed sample pair; />
Figure QLYQS_15
The expression number is->
Figure QLYQS_16
Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>
Figure QLYQS_20
The expression number is->
Figure QLYQS_4
Vector of the samples of (a) after transformation by the contrast learning structural model, < + >>
Figure QLYQS_10
The expression number is->
Figure QLYQS_13
The vector of the sample converted by the contrast learning structural model; />
Figure QLYQS_17
Representing the similarity of the two samples, and calculating by adopting cosine similarity; n represents the total number of samples before data enhancement; />
Figure QLYQS_18
Representing the adjustment parameters, the value of 0 or 1, representing when +.>
Figure QLYQS_21
When (I)>
Figure QLYQS_5
The value is 1, otherwise, the value is 0; />
Figure QLYQS_14
Representing a temperature parameter for controlling the degree of uniformity of the sample distribution; / >
Figure QLYQS_19
Representing the final loss function.
3. The unsupervised recognition method for scientific literature knowledge entity according to claim 2, wherein in step S24, the re-characterized word vectors are clustered by using a K-means algorithm, and the method comprises the following steps:
s241, selecting K words in the word vector space re-characterized after contrast learning as initial cluster centers;
s242, calculating the distance between all other word vectors in the word vector space and the center of each cluster, and dividing each sample word into clusters closest to each other, wherein the probability that the sample word corresponding to the word vector belongs to the cluster class is considered to be larger as the distance between the word vector and the center of each cluster is smaller;
s243, after all sample words in the vector space are calculated, calculating the average value vector of all sample words in each cluster, taking the average value vector of all sample words in each cluster as a new cluster center, and updating the original cluster center; the mean vector calculation formula of the sample word is as follows:
Figure QLYQS_22
in the formula ,
Figure QLYQS_23
wherein ,
Figure QLYQS_24
mean vector representing sample word, ++>
Figure QLYQS_25
Representing a cluster->
Figure QLYQS_26
Representation->
Figure QLYQS_27
A certain vector in the cluster, < >>
Figure QLYQS_28
Representation of
Figure QLYQS_29
The number of sample words of the cluster;
s244, repeating the steps S241 to S243 until the cluster center is not changed, and completing training.
4. The unsupervised identification method for scientific literature knowledge entity according to claim 3, wherein the setting scheme of the cluster number K is as follows:
assuming that the data to be classified are clustered through a clustering algorithm, and finally obtaining K clusters; for each sample word in each cluster, its contour coefficient is calculated separately, and for each sample word the following index is calculated:
Figure QLYQS_30
: average value of distances from sample points to other sample points belonging to the same cluster; />
Figure QLYQS_31
The smaller the value, the greater the likelihood that the sample point belongs to the category;
Figure QLYQS_32
: average distance of sample point to all samples in other clusters +.>
Figure QLYQS_33
Minimum value->
Figure QLYQS_34
The calculation formula of (2) is as follows:
Figure QLYQS_35
sample point
Figure QLYQS_36
The profile coefficients of (a) are:
Figure QLYQS_37
;/>
wherein ,
Figure QLYQS_38
representing sample points->
Figure QLYQS_39
Is a contour coefficient of (2);
the average value of the contour coefficients of all sample points is the average contour coefficient of the clustering result
Figure QLYQS_40
,/>
Figure QLYQS_41
The method comprises the steps of carrying out a first treatment on the surface of the The closer the distance of samples within a cluster, the more the distance of samples between clustersThe farther the average profile factor is, the greater the clustering effect is.
5. The method for unsupervised identification of scientific literature-oriented knowledge entities according to claim 4, wherein S3 comprises the steps of:
s31, word segmentation is carried out on the text to be detected, and nouns in the text are recognized
Figure QLYQS_42
And covering;
s32, predicting possible output words of the covered words by using the whole word covering model obtained in S14
Figure QLYQS_43
S33, combining the knowledge entity representative word set obtained in the S2, and calculating covered words
Figure QLYQS_44
Belongs to category->
Figure QLYQS_45
Score in (a)
Figure QLYQS_46
S34, setting a threshold, and when the score is
Figure QLYQS_47
If the threshold value is greater than the threshold value, the covering word is determined as a knowledge entity and belongs to the corresponding entity class +.>
Figure QLYQS_48
Otherwise, the cover term is deemed not to be a knowledge entity.
6. The method for unsupervised recognition of a scientific literature-oriented knowledge entity according to claim 5, wherein in step S33, the words are masked
Figure QLYQS_49
Belongs to category->
Figure QLYQS_50
Score of->
Figure QLYQS_51
The calculation method of (2) is as follows:
s341, masking words by using the pre-trained whole word masking model
Figure QLYQS_54
Predicting possible words ++>
Figure QLYQS_59
The method comprises the steps of carrying out a first treatment on the surface of the Setting threshold +.>
Figure QLYQS_62
Predicted probability +.>
Figure QLYQS_53
Vocabulary of->
Figure QLYQS_57
Taking out, and calculating the +.>
Figure QLYQS_61
Are->
Figure QLYQS_64
All representative words->
Figure QLYQS_52
Average semantic similarity of (2); the extracted predictive word ++>
Figure QLYQS_56
And entity class->
Figure QLYQS_60
Weighted average is performed on the semantic similarity of (2) to finally obtain the cover word +.>
Figure QLYQS_63
And entity class->
Figure QLYQS_55
Semantic similarity +.>
Figure QLYQS_58
The formula is as follows:
Figure QLYQS_65
wherein ,
Figure QLYQS_66
the number of the representative words contained in a certain category in the representative word set;
S342, setting the entity types containing more entity words to have larger weights, setting different weights for different entity types in the representative word set, using
Figure QLYQS_67
Indicating that the entity cluster contains +.>
Figure QLYQS_68
The weight calculation formula is as follows:
Figure QLYQS_69
wherein ,
Figure QLYQS_70
representation category->
Figure QLYQS_71
The weight value given;
s343, recalculateCover word
Figure QLYQS_72
Type of attribution->
Figure QLYQS_73
Score of->
Figure QLYQS_74
The calculation formula is as follows:
Figure QLYQS_75
7. the method of unsupervised identification of a scientific literature-oriented knowledge entity according to any one of claims 1 to 6, wherein step S1 comprises the steps of:
s11, collecting title, key words and abstract data of related scientific field documents from a public database to form basic corpus data, adding the key words to a field dictionary after removing the key words repeatedly and manually to remove words which obviously do not belong to a knowledge entity, and splicing the title and abstract data to form the basic corpus;
s12, extracting character strings with frequency in a specified range from the basic corpus by adopting an N-gram string frequency statistical algorithm, then, updating the frequency of the character string words of the existing domain dictionary, and directly adding the character string words which do not appear in the domain dictionary and the frequency thereof into the domain dictionary;
s13, performing word segmentation processing by combining the basic corpus with the domain dictionary, performing full-word covering processing on words appearing in the domain dictionary, and training by adopting a full-word covering model so that the full-word covering model obtains the context semantic representation of the words in the domain.
8. An unsupervised identification system for scientific literature-oriented knowledge entity, which is used for implementing an unsupervised identification method for scientific literature-oriented knowledge entity according to any one of claims 1 to 7, and comprises the following modules connected in sequence:
the pre-training module: the method comprises the steps of processing collected unlabeled scientific literature text data to obtain training corpus of a full-word covering model, constructing a domain dictionary by combining a serial frequency statistical algorithm, inputting the full-word covering model to train the full-word covering model after word segmentation processing is carried out on the training corpus by taking the domain dictionary as a guide, and enabling the full-word covering model to learn the contextual semantics and grammatical features of words in the related scientific domain;
the knowledge entity category represents a word learning module: the method comprises the steps of inputting training corpus after word segmentation in combination with a domain dictionary into a word vector representation model for training to obtain vector representations of words in the domain dictionary, relearning the vectors of the words by utilizing a comparison learning structure model, and clustering to obtain a set of knowledge entity representative words and categories thereof, wherein the set is used as a basis for judging whether text words are knowledge entities in a recognition process;
knowledge entity identification module: the method comprises the steps of carrying out covering processing on words in a scientific literature text to be recognized, predicting the covered words by using a trained full word covering model, calculating similarity scores between the obtained predicted words and words in a constructed representative word set, judging whether the covered words are knowledge entities or not, and determining the categories of the covered words.
CN202310323198.6A 2023-03-30 2023-03-30 Unsupervised identification method and system oriented to scientific literature knowledge entity Active CN116050419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310323198.6A CN116050419B (en) 2023-03-30 2023-03-30 Unsupervised identification method and system oriented to scientific literature knowledge entity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310323198.6A CN116050419B (en) 2023-03-30 2023-03-30 Unsupervised identification method and system oriented to scientific literature knowledge entity

Publications (2)

Publication Number Publication Date
CN116050419A CN116050419A (en) 2023-05-02
CN116050419B true CN116050419B (en) 2023-06-02

Family

ID=86129854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310323198.6A Active CN116050419B (en) 2023-03-30 2023-03-30 Unsupervised identification method and system oriented to scientific literature knowledge entity

Country Status (1)

Country Link
CN (1) CN116050419B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116798633B (en) * 2023-08-22 2023-11-21 北京大学人民医院 Construction method of wound data security risk assessment system and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN113988073A (en) * 2021-10-26 2022-01-28 迪普佰奥生物科技(上海)股份有限公司 Text recognition method and system suitable for life science
CN114254653A (en) * 2021-12-23 2022-03-29 深圳供电局有限公司 Scientific and technological project text semantic extraction and representation analysis method
CN114282592A (en) * 2021-11-15 2022-04-05 清华大学 Deep learning-based industry text matching model method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN113988073A (en) * 2021-10-26 2022-01-28 迪普佰奥生物科技(上海)股份有限公司 Text recognition method and system suitable for life science
CN114282592A (en) * 2021-11-15 2022-04-05 清华大学 Deep learning-based industry text matching model method and device
CN114254653A (en) * 2021-12-23 2022-03-29 深圳供电局有限公司 Scientific and technological project text semantic extraction and representation analysis method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark;Ningyu Zhang 等;《Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics》;第1卷;7888–7915 *
基于BERT-BLSTM-CRF的政务领域命名实体识别方法;杨春明 等;《西南科技大学学报》;第35卷(第3期);86-91 *
基于BERT的危险化学品命名实体识别模型;陈观林 等;《广西科学》;第30卷(第1期);43-51 *

Also Published As

Publication number Publication date
CN116050419A (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN111353029B (en) Semantic matching-based multi-turn spoken language understanding method
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN111597328B (en) New event theme extraction method
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN116050419B (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN108536781B (en) Social network emotion focus mining method and system
CN115062104A (en) Knowledge prompt-fused legal text small sample named entity identification method
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN114756678A (en) Unknown intention text identification method and device
CN116361442B (en) Business hall data analysis method and system based on artificial intelligence
CN113722494A (en) Equipment fault positioning method based on natural language understanding
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN115344695A (en) Service text classification method based on field BERT model
CN112182213B (en) Modeling method based on abnormal lacrimation feature cognition
CN114357166A (en) Text classification method based on deep learning
CN111581339B (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN112287690A (en) Sign language translation method based on conditional sentence generation and cross-modal rearrangement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant