CN116401369A - Entity identification and classification method for biological product production terms - Google Patents
Entity identification and classification method for biological product production terms Download PDFInfo
- Publication number
- CN116401369A CN116401369A CN202310665618.9A CN202310665618A CN116401369A CN 116401369 A CN116401369 A CN 116401369A CN 202310665618 A CN202310665618 A CN 202310665618A CN 116401369 A CN116401369 A CN 116401369A
- Authority
- CN
- China
- Prior art keywords
- word
- production
- entity
- word vector
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004519 manufacturing process Methods 0.000 title claims abstract description 103
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 114
- 229960000074 biopharmaceutical Drugs 0.000 claims abstract description 30
- 238000002372 labelling Methods 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000003062 neural network model Methods 0.000 claims abstract description 14
- 238000003064 k means clustering Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 30
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 13
- 210000004027 cell Anatomy 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000003745 diagnosis Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 238000000338 in vitro Methods 0.000 claims description 4
- 238000001727 in vivo Methods 0.000 claims description 4
- 230000002265 prevention Effects 0.000 claims description 4
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 102000014150 Interferons Human genes 0.000 description 2
- 108010050904 Interferons Proteins 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 229940047124 interferons Drugs 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 102000015081 Blood Coagulation Factors Human genes 0.000 description 1
- 108010039209 Blood Coagulation Factors Proteins 0.000 description 1
- 102000004506 Blood Proteins Human genes 0.000 description 1
- 108010017384 Blood Proteins Proteins 0.000 description 1
- 102000018386 EGF Family of Proteins Human genes 0.000 description 1
- 108010066486 EGF Family of Proteins Proteins 0.000 description 1
- 101000976075 Homo sapiens Insulin Proteins 0.000 description 1
- 102000002265 Human Growth Hormone Human genes 0.000 description 1
- 108010000521 Human Growth Hormone Proteins 0.000 description 1
- 108060003951 Immunoglobulin Proteins 0.000 description 1
- 230000001147 anti-toxic effect Effects 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 102000036639 antigens Human genes 0.000 description 1
- 108091007433 antigens Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003114 blood coagulation factor Substances 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 102000018358 immunoglobulin Human genes 0.000 description 1
- 229940072221 immunoglobulins Drugs 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- PBGKTOXHQIOBKM-FHFVDXKLSA-N insulin (human) Chemical compound C([C@@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@H]1CSSC[C@H]2C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@H](C(N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC=3C=CC(O)=CC=3)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CC=3C=CC(O)=CC=3)C(=O)N[C@@H](CSSC[C@H](NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC=3C=CC(O)=CC=3)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC=3NC=NC=3)NC(=O)[C@H](CO)NC(=O)CNC1=O)C(=O)NCC(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCCNC(N)=N)C(=O)NCC(=O)N[C@@H](CC=1C=CC=CC=1)C(=O)N[C@@H](CC=1C=CC=CC=1)C(=O)N[C@@H](CC=1C=CC(O)=CC=1)C(=O)N[C@@H]([C@@H](C)O)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([C@@H](C)O)C(O)=O)C(=O)N[C@@H](CC(N)=O)C(O)=O)=O)CSSC[C@@H](C(N2)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](C(C)C)NC(=O)[C@@H](NC(=O)CN)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@@H](N)CC=1C=CC=CC=1)C(C)C)C1=CN=CN1 PBGKTOXHQIOBKM-FHFVDXKLSA-N 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 229960005486 vaccine Drugs 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of entity identification of production terms in the biopharmaceutical production process, and provides an entity identification and classification method for the production terms of biological products, which comprises the following steps: word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained; manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set; constructing a word vector +BiLSTM +CRF neural network model based on the first word vector model, and training the model on the constructed data set to obtain a second word vector model; performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result; the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.
Description
Technical Field
The invention relates to the field of entity identification of production terms in the biopharmaceutical production process, in particular to an entity identification and classification method for the production terms of biological products.
Background
With the continuous and deep development of intelligent manufacturing, in the production of the biopharmaceutical industry, machine learning needs to be carried out on production terms, automatic recognition processing is carried out on a computer, and an entity recognition and classification method is an important basis for realizing intelligent production and control and is also a bottom technology for intelligent manufacturing information processing.
Disclosure of Invention
The invention aims to provide an entity identification and classification method for biological product production terms, which can accurately realize automatic identification and classification of biological product production terms.
The invention solves the technical problems and adopts the following technical scheme:
an entity identification and classification method for biological product production terms, comprising the steps of:
word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained;
manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set;
constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.
As a further explanation, word vector training is carried out on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
the continuous bag-of-words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model is expressed as:
wherein T represents the current time, T represents the total time number, andis the center word in the text of the current production term,2c words before and after the central word, and predicting the central word according to the known 2c words and the continuous word bag model>And the center word->The probability of occurrence is related to the 2c words before and after the word.
As a further illustration, the continuous bag of words model first subjects the center word to one-hot encodingThe front and back 2c words form the corresponding word vector +.>Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front 2c words and the rear 2c words is obtained/>Training by maximum likelihood function estimation to obtain each word ++>Final word vector->The word vector trained in a unified way is +.>Where n represents the number of word vectors.
As a further illustration, the artificial labeling of unlabeled corpora in biopharmaceutical production, constructing a data set, specifically includes:
preprocessing data of original corpus, including deleting irrelevant content, special symbols and removing stop words;
according to the difference of actual production lines, preliminarily determining the identified entity categories, wherein the entity categories comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B, the non-beginning part of the entity is represented by I, and the non-entity part is represented by O.
As a further illustration, the construction of the word vector +BiLSTM +CRF neural network model specifically comprises:
inputting the word vector obtained by training into a BiLSTM neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
By way of further illustration, the states of the BiLSTM neural network model neurons are calculated by the following formula:
wherein ,is a Sigmoid function->Is the input word vector at the current moment, +.>Is the hidden layer state of the last moment, +.>Is a forgetting door, decides the information category to be forgotten, < ->Is an input gate, determines the kind of information to be retained,/->Is the input word vector at the current moment +.>Acquired intermediate state,/->Is a memory cell, controls the change of state of the cell, < ->Is the state value of the last moment, +.>Is the output value in the memory cell, +.>Is the hidden layer state at the current moment, +.>Feedback connection matrix representing forgetting gate, +.>Feedback matrix representing input gates, +.>Feedback matrix representing hidden units +.>Feedback matrix representing output gates, +.>Threshold value representing forgetful door, ++>Threshold value representing input gate, +.>Threshold value representing hidden layer element->Representing the threshold of the output gate.
As a further illustration, the inputting the obtained feature vector with the context information into the CRF, extracting the dependency features between the labels, and calculating the loss function specifically includes:
corresponding word vector sequences for text input words given production termsAnd a predicted sequence corresponding to each input word +.>And define the predictive score of y +.>:
wherein ,is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
the log likelihood function of the probability is:
wherein Representing the actual labeling sequence,/->Representing all possible labeling sequences, +.>Representing other path scores;
the loss of loss function is defined as:
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probabilityMaximum prediction labeling sequence +.>The expression is as follows:
as a further explanation, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically includes:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence of language in the text is obtained in a CRF (color filter) by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
By way of further illustration, the improvement of the k-means clustering algorithm is specifically: after normalizing the word vector, redefining the distance of cosine similarityThe original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as followsOptional word vector->Will beAfter word vector normalization, deduce:
wherein Is-> and />European distance,/, of->Is-> and />From the distance equalization, the improved cosine similarity distance +.>The definition is as follows:
this gives:
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
wherein Is cluster->Mean vector of>The compactness of the entities in the cluster around the cluster mean vector is characterized, and the smaller the value is, the higher the similarity of the entities in the cluster is.
By way of further illustration, the classification of entities of biopharmaceutical production terminology text is accomplished by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, specifically including:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set beThe term vector of the entity to be classified is +.>Then-> and />The cosine similarity calculation formula of (2) is:
wherein ,,/>the larger the value, the> and />The higher the association, i.e +.>The closer to 1 represents and />The more similar.
The beneficial effects of the invention are as follows: through the entity identification and classification method for the biological product production term, words or words commonly used in the biological pharmacy production can be established through mapping based on a deep learning method, a vector space is established, the vector space is input into a neural network for feature extraction, finally, CRF (central processing unit) is combined for labeling prediction, a relatively accurate identification result is output, and reasonable classification is carried out.
Drawings
FIG. 1 is a flow chart of a method for identifying and classifying entities for use in terms of biological product production in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a CBOW model of an embodiment of the invention;
FIG. 3 is a block diagram of a BiLSTM neural network in accordance with an embodiment of the present invention;
FIG. 4 is a block diagram of a CRF of an embodiment of the invention;
FIG. 5 is a block diagram of a word vector +BiLSTM +CRF neural network model, in accordance with an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
The embodiment provides a method for identifying and classifying entities for biological product production terms, the flow chart of which is shown in fig. 1, wherein the method comprises the following steps:
s1, carrying out word vector training on unlabeled corpus in biopharmaceutical production to obtain a first word vector model;
s2, manually labeling unlabeled corpus in biopharmaceutical production to construct a data set;
s3, constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
s4, performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
s5, clustering entity word vectors in the data set into 20-50 clusters through an improved k-means clustering algorithm, and comparing cosine similarity between each cluster and the entity word vectors of the identified data set to realize entity classification of the biopharmaceutical production term text.
In this embodiment, referring to fig. 2, word vector training may be performed on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model (CBOW) in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
here, the continuous bag of words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model can be expressed as:
wherein T represents the current time, T tableIndicating the total time, and settingIs the center word in the text of the current production term,2c words before and after the central word, and predicting the central word according to the known 2c words and the continuous word bag model>And the center word->The probability of occurrence is related to the 2c words before and after the word.
It should be noted that, in this embodiment, the continuous bag-of-words model first encodes the center word by single-hot encodingThe front and back 2c words form the corresponding word vector +.>Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front and rear 2c words is obtained>Training by maximum likelihood function estimation to obtain each word ++>Final word vector->For convenience of presentation, the word vectors trained in this way areWhere n represents the number of word vectors.
In this embodiment, the manually labeling the unlabeled corpus in the biopharmaceutical production to construct the data set specifically includes:
preprocessing data of original corpus, including deleting irrelevant content, special symbols, removing stop words, etc.;
according to different actual production lines, preliminarily determining the identified entity types, wherein the entity types comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis products, wherein the prevention biological products such as various vaccines, immunoglobulins, interferons, human coagulation factors and the like, the treatment biological products such as antitoxin, human blood proteins, human interferons, human insulin, growth hormone, human epidermal growth factors and the like, and in-vivo and in-vitro diagnosis products such as protein derivatives, surface antigen detection reagents and the like;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B (Begin), the non-beginning part of the entity is represented by I (Inside), and the non-entity part is represented by O (Outside).
It should be noted that, referring to fig. 3, fig. 4 and fig. 5, the construction of the word vector+bilstm+crf neural network model may specifically include:
inputting the word vector obtained by training into a BiLSTM (Bi-directional Long Short-Term Memory) neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
LSTM is a special recurrent neural network, and the states of the neurons of the BiLSTM neural network model are calculated by the following formula:
wherein ,is a Sigmoid function->Is the input word vector at the current moment, +.>Is the hidden layer state of the last moment, +.>Is a forgetting door, decides the information category to be forgotten, < ->Is an input gate, determines the kind of information to be retained,/->Is the input word vector at the current moment +.>Acquired intermediate state,/->Is a memory cell, controls the change of state of the cell, < ->Is the state value of the last moment, +.>Is the output value in the memory cell, +.>Is the hidden layer state at the current moment, +.>Feedback connection matrix representing forgetting gate,/>Feedback matrix representing input gates, +.>Feedback matrix representing hidden units +.>Feedback matrix representing output gates, +.>Threshold value representing forgetful door, ++>Threshold value representing input gate, +.>Threshold value representing hidden layer element->Representing the threshold of the output gate.
In the actual application process, inputting the obtained feature vector with the context information into the CRF, extracting the dependency features among labels, and calculating a loss function, wherein the method specifically comprises the following steps:
corresponding word vector sequences for text input words given production termsAnd a predicted sequence corresponding to each input word +.>And define the predictive score of y +.>:
wherein ,is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
the log likelihood function of the probability is:
wherein Representing the actual labeling sequence,/->Representing all possible labeling sequences, +.>Representing other path scores;
the loss of loss function is defined as:
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probabilityMaximum prediction labeling sequence +.>The expression is as follows:
additionally, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically comprises the following steps:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence is obtained in the CRF by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
In this embodiment, the improvement of the k-means clustering algorithm specifically refers to: after normalizing the word vector, redefining the distance of cosine similarityThe original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as followsOptional word vector->After normalizing the word vector, push out:
wherein Is-> and />European distance,/, of->Is-> and />From the distance equalization, the improved cosine similarity distance +.>The definition is as follows:
this gives:
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
wherein Is cluster->Mean vector of>The degree of closeness of an intra-cluster entity around a cluster mean vector is described, with the smaller the value, the higher the intra-cluster entity similarity.
Finally, the entity classification of the text of the biopharmaceutical production term is realized by comparing the cosine similarity between each cluster and the entity word vector of the identified dataset, and specifically comprises the following steps:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set beThe term vector of the entity to be classified is +.>Then-> and />The cosine similarity calculation formula of (2) is:
wherein ,,/>the larger the value, the> and />The higher the association, i.e +.>The closer to 1 represents and />The more similar.
Therefore, the CBOW model used by the invention can perform unsupervised training through large-scale unlabeled data to obtain word vectors with strong semantic expression capability as the input of a subsequent model; moreover, as the BiLSTM neural network is introduced, the global features with the context semantic information in the text sequence of the production term are extracted; meanwhile, on the basis of global features extracted by BiLSTM, the dependency relationship between labels is learned through CRF, so that the accuracy of entity identification in the production term text is improved; finally, by redefining the distance of cosine similarityThe original Euclidean distance calculation method in the K-means algorithm is improved, the method is suitable for entity classification of biopharmaceutical production terms, and the whole process is simple to operate and high in portability.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method for identifying and classifying entities for the production of biological products, characterized in that it comprises the following steps:
word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained;
manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set;
constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.
2. The method for identifying and classifying entities for terms of biopharmaceutical production of claim 1, wherein Word vector training is performed on unlabeled corpus in biopharmaceutical production using continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
the continuous bag-of-words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model is expressed as:
wherein T represents the current time, T represents the total time number, andis the center word in the text of the current production term,the task at this time is to predict the continuous bag of words model based on the known 2c words, which are the 2c words before and after the center wordCenter word->And the center word->The probability of occurrence is related to the 2c words before and after the word.
3. The method for entity identification and classification of biological product production terms according to claim 2, wherein the continuous bag of words model first encodes the center word by single hot encodingThe front and back 2c words form corresponding word vectorsThen the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front and rear 2c words is obtained>Training through maximum likelihood function estimation method to obtain each wordFinal word vector->The word vector trained in a unified way is +.>Where n represents the number of word vectors.
4. The method for identifying and classifying entities for biological product production terms according to claim 1, wherein the manually labeling unlabeled corpus in biopharmaceutical production to construct a dataset comprises:
preprocessing data of original corpus, including deleting irrelevant content, special symbols and removing stop words;
according to the difference of actual production lines, preliminarily determining the identified entity categories, wherein the entity categories comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B, the non-beginning part of the entity is represented by I, and the non-entity part is represented by O.
5. The method for identifying and classifying entities for biological production terms according to claim 1, wherein said constructing a word vector +bilstm +crf neural network model specifically comprises:
inputting the word vector obtained by training into a BiLSTM neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
6. The method of claim 5, wherein the states of the BiLSTM neural network model neurons are calculated by the following formula:
wherein ,is a Sigmoid function->Is the input word vector at the current moment, +.>Is the hidden layer state of the last moment, +.>Is a forgetting door, decides the information category to be forgotten, < ->Is an input gate, determines the kind of information to be retained,/->Is the input word vector at the current moment +.>Acquired intermediate state,/->Is a memory cell, controls the change of state of the cell, < ->Is the state value of the last moment, +.>Is the output value in the memory cell, +.>Is the hidden layer state at the current moment, +.>Feedback connection matrix representing forgetting gate, +.>Feedback matrix representing input gates, +.>Feedback matrix representing hidden units +.>Feedback matrix representing output gates, +.>Threshold value representing forgetful door, ++>Threshold value representing input gate, +.>Threshold value representing hidden layer element->Representing the threshold of the output gate.
7. The method for identifying and classifying entities for biological production terms according to claim 5, wherein said inputting the obtained feature vector with context information to CRF, extracting the dependency features between labels, and calculating the loss function, comprises:
corresponding word vector sequences for text input words given production termsAnd a predicted sequence corresponding to each input word +.>And define the predictive score of y +.>:
wherein ,is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
the log likelihood function of the probability is:
wherein Representing the actual labeling sequence,/->Representing all possible labeling sequences, +.>Representing other path scores;
the loss of loss function is defined as:
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probabilityMaximum prediction labeling sequence +.>The expression is as follows:
8. the method for identifying and classifying entities of biological product production terms according to claim 1, wherein the identifying the entities of the biological product production term text to be identified by using the second word vector model to obtain an identification result specifically comprises:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence is obtained in the CRF by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
9. The method for identifying and classifying entities for biological production terms according to claim 1, characterized in that the k-means clustering algorithm is modified, in particular: after normalizing the word vector, redefining the distance of cosine similarityThe original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as followsOptional word vector->After normalizing the word vector, push out:
wherein Is-> and />European distance,/, of->Is-> and />From the distance equalization, the improved cosine similarity distance +.>The definition is as follows:
this gives:
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
10. The method for entity identification and classification of biopharmaceutical production terms according to claim 1, wherein the entity classification of biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, in particular comprising:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set beThe term vector of the entity to be classified in terms of production is as followsThen-> and />The cosine similarity calculation formula of (2) is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310665618.9A CN116401369B (en) | 2023-06-07 | 2023-06-07 | Entity identification and classification method for biological product production terms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310665618.9A CN116401369B (en) | 2023-06-07 | 2023-06-07 | Entity identification and classification method for biological product production terms |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116401369A true CN116401369A (en) | 2023-07-07 |
CN116401369B CN116401369B (en) | 2023-08-11 |
Family
ID=87018329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310665618.9A Active CN116401369B (en) | 2023-06-07 | 2023-06-07 | Entity identification and classification method for biological product production terms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116401369B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117454892A (en) * | 2023-12-20 | 2024-01-26 | 深圳市智慧城市科技发展集团有限公司 | Metadata management method, device, terminal equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298042A (en) * | 2019-06-26 | 2019-10-01 | 四川长虹电器股份有限公司 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
CN113191148A (en) * | 2021-04-30 | 2021-07-30 | 西安理工大学 | Rail transit entity identification method based on semi-supervised learning and clustering |
CN114091460A (en) * | 2021-11-24 | 2022-02-25 | 长沙理工大学 | Multitask Chinese entity naming identification method |
CN114996287A (en) * | 2022-06-20 | 2022-09-02 | 上海电器科学研究所(集团)有限公司 | Automatic equipment identification and capacity expansion method based on feature library |
CN115510864A (en) * | 2022-10-14 | 2022-12-23 | 昆明理工大学 | Chinese crop disease and pest named entity recognition method fused with domain dictionary |
CN115859980A (en) * | 2022-11-24 | 2023-03-28 | 山东鲁软数字科技有限公司 | Semi-supervised named entity identification method, system and electronic equipment |
CN115859164A (en) * | 2022-09-09 | 2023-03-28 | 第三维度(河南)软件科技有限公司 | Method and system for identifying and classifying building entities based on prompt |
CN116127084A (en) * | 2022-10-21 | 2023-05-16 | 中国农业大学 | Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method |
CN116186266A (en) * | 2023-03-06 | 2023-05-30 | 欧冶工业品股份有限公司 | BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system |
CN116187444A (en) * | 2023-03-01 | 2023-05-30 | 中国人民解放军国防科技大学 | K-means++ based professional field sensitive entity knowledge base construction method |
-
2023
- 2023-06-07 CN CN202310665618.9A patent/CN116401369B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298042A (en) * | 2019-06-26 | 2019-10-01 | 四川长虹电器股份有限公司 | Based on Bilstm-crf and knowledge mapping video display entity recognition method |
CN113191148A (en) * | 2021-04-30 | 2021-07-30 | 西安理工大学 | Rail transit entity identification method based on semi-supervised learning and clustering |
CN114091460A (en) * | 2021-11-24 | 2022-02-25 | 长沙理工大学 | Multitask Chinese entity naming identification method |
CN114996287A (en) * | 2022-06-20 | 2022-09-02 | 上海电器科学研究所(集团)有限公司 | Automatic equipment identification and capacity expansion method based on feature library |
CN115859164A (en) * | 2022-09-09 | 2023-03-28 | 第三维度(河南)软件科技有限公司 | Method and system for identifying and classifying building entities based on prompt |
CN115510864A (en) * | 2022-10-14 | 2022-12-23 | 昆明理工大学 | Chinese crop disease and pest named entity recognition method fused with domain dictionary |
CN116127084A (en) * | 2022-10-21 | 2023-05-16 | 中国农业大学 | Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method |
CN115859980A (en) * | 2022-11-24 | 2023-03-28 | 山东鲁软数字科技有限公司 | Semi-supervised named entity identification method, system and electronic equipment |
CN116187444A (en) * | 2023-03-01 | 2023-05-30 | 中国人民解放军国防科技大学 | K-means++ based professional field sensitive entity knowledge base construction method |
CN116186266A (en) * | 2023-03-06 | 2023-05-30 | 欧冶工业品股份有限公司 | BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system |
Non-Patent Citations (3)
Title |
---|
CHUANHAI DONG ET AL.: "Character-based LSTM-CRF witth Radical-level Features for Chinese Named Entity Recognition", 《NATIONAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS 2016》, pages 239 - 250 * |
刘越: "K-means聚类算法的改进", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2, pages 138 - 2336 * |
黎航宇: "命名实体识别中适应性特征的跨领域与跨风格特性研究", 《软件》, vol. 35, no. 10, pages 100 - 106 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117454892A (en) * | 2023-12-20 | 2024-01-26 | 深圳市智慧城市科技发展集团有限公司 | Metadata management method, device, terminal equipment and storage medium |
CN117454892B (en) * | 2023-12-20 | 2024-04-02 | 深圳市智慧城市科技发展集团有限公司 | Metadata management method, device, terminal equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116401369B (en) | 2023-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hasani et al. | Spatio-temporal facial expression recognition using convolutional neural networks and conditional random fields | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN112614538A (en) | Antibacterial peptide prediction method and device based on protein pre-training characterization learning | |
CN113033249A (en) | Character recognition method, device, terminal and computer storage medium thereof | |
CN110110324B (en) | Biomedical entity linking method based on knowledge representation | |
Ju et al. | Fish species recognition using an improved AlexNet model | |
CN111476024A (en) | Text word segmentation method and device and model training method | |
CN116401369B (en) | Entity identification and classification method for biological product production terms | |
WO2010062268A1 (en) | A method for updating a 2 dimensional linear discriminant analysis (2dlda) classifier engine | |
Thomas et al. | A deep HMM model for multiple keywords spotting in handwritten documents | |
Kumar et al. | Future of machine learning (ML) and deep learning (DL) in healthcare monitoring system | |
CN113705238A (en) | Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN111581974A (en) | Biomedical entity identification method based on deep learning | |
WO2020108808A1 (en) | Method and system for classification of data | |
Rahman et al. | IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data | |
Chen et al. | DeepGly: A deep learning framework with recurrent and convolutional neural networks to identify protein glycation sites from imbalanced data | |
CN113191150B (en) | Multi-feature fusion Chinese medical text named entity identification method | |
Wayahdi et al. | KNN and XGBoost Algorithms for Lung Cancer Prediction | |
CN114722798A (en) | Ironic recognition model based on convolutional neural network and attention system | |
CN118013038A (en) | Text increment relation extraction method based on prototype clustering | |
Missaoui et al. | Multi-stream continuous hidden Markov models with application to landmine detection | |
CN113312907A (en) | Remote supervision relation extraction method and device based on hybrid neural network | |
CN114692615B (en) | Small sample intention recognition method for small languages | |
CN116757195A (en) | Implicit emotion recognition method based on prompt learning | |
CN116386733A (en) | Protein function prediction method based on multi-view multi-scale multi-attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |