CN116186266A - BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system - Google Patents

BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system Download PDF

Info

Publication number
CN116186266A
CN116186266A CN202310215293.4A CN202310215293A CN116186266A CN 116186266 A CN116186266 A CN 116186266A CN 202310215293 A CN202310215293 A CN 202310215293A CN 116186266 A CN116186266 A CN 116186266A
Authority
CN
China
Prior art keywords
bert
model
training
classification
ner
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310215293.4A
Other languages
Chinese (zh)
Inventor
夏竟翔
沈达峰
朱俊
姚泽坤
闫晨光
李燕北
孙志强
戴智鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ouye Industrial Products Co ltd
Original Assignee
Ouye Industrial Products Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ouye Industrial Products Co ltd filed Critical Ouye Industrial Products Co ltd
Priority to CN202310215293.4A priority Critical patent/CN116186266A/en
Publication of CN116186266A publication Critical patent/CN116186266A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a material classification optimization method and system for BERT and NER entity extraction and knowledge graph, comprising the following steps: step S1: processing the basic text and cleaning the material data; step S2: extracting entity information in the material data by using the NER model, and marking corresponding labels; step S3: training a BERT model based on the material data with the entity tag; step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material vector clusters by using kmeans, and training the BERT classifier again until the classification accuracy of all the categories is higher than a preset value. According to the invention, the NER entity extraction model is adopted to extract the structure of the key entity in the material information, so that the material information of the original material data is enhanced, and the problem that important text information is difficult to quickly focus during BERT model training is solved.

Description

BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system
Technical Field
The invention relates to the field of deep learning, in particular to a material classification optimization method and system based on BERT and NER entity extraction and knowledge-graph.
Background
Numerous material classification methods currently use bert+lstm, bert+cnn, bert+crf, or BERT model variants in an attempt to achieve better classification by complicating BERT, but this method has limited improvement, typically below 5%, with a significant increase in training costs.
Patent document CN110413785B (application number: CN 201910675003.8) discloses an automatic text classification method based on bert and feature fusion, comprising: firstly, cleaning text data, converting the text into dynamic word vectors through BERT, extracting the characteristics of the text by using CNN and BiLSTM, and respectively transmitting word vector sequences output by BERT to a CNN network and a BiLSTM network; and then splicing the output of the CNN network and the output of the BiLSTM network together for feature fusion, and finally outputting a final predictive probability vector through the full connection layer and the softmax layer. Although the BERT+CNN+BiLSTM classification method is used, the model training cost is greatly increased, and the effect improvement is limited.
Patent document CN110334210A (application number: CN 201910462751.8) discloses a Chinese emotion analysis method based on BERT and LSTM, CNN fusion. The method comprises the following steps: text preprocessing is carried out on a plurality of Chinese corpora in the Chinese corpus data set to obtain a plurality of sequences corresponding to the plurality of Chinese corpora; extracting word embeddings for each sequence using a BERT model; extracting features of each sequence by adopting BERT, LSTM and CNN to obtain deep semantic features of the text corresponding to each sequence; the obtained text deep semantic features are classified by using a softmax classifier to train and test the model, so that emotion polarity prediction analysis is realized. However, the invention increases the model training cost and has limited effect improvement.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a material classification optimization method and system for BERT and NER entity extraction and knowledge graph.
The invention provides a material classification optimization method for BERT, NER entity extraction and knowledge graph, which comprises the following steps:
step S1: processing the basic text and cleaning the material data;
step S2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
step S3: training a BERT model based on the material data with the entity tag;
step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
Preferably, in said step S1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
step S1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
step S1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
Preferably, in said step S2:
step S2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
step S2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
Preferably, in said step S3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
Preferably, in said step S4:
step S4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
step S4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
step S4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
According to the invention, the material classification optimization system for BERT, NER entity extraction and knowledge graph comprises:
module M1: processing the basic text and cleaning the material data;
module M2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
module M3: training a BERT model based on the material data with the entity tag;
module M4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
Preferably, in said module M1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
module M1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
module M1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
Preferably, in said module M2:
module M2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
module M2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
Preferably, in said module M3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
Preferably, in said module M4:
module M4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
module M4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
module M4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, the NER entity extraction model is adopted to extract the structure of the key entity in the material information, so that the material information of the original material data is enhanced, and the problem that important text information is difficult to quickly focus during BERT model training is solved;
2. the invention combines or corrects the leaf class by adopting kmeans clustering, and iterates BERT again until the model performance can not be continuously improved, thereby solving the problem of classification noise caused by unreasonable material system and optimizing the performance of BERT classification;
3. the method can remove too few leaf types of the sample and reduce potential bias hidden trouble.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is an entity extraction tag and tag meaning;
fig. 3 is an example of NER extraction results.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1:
classifying materials, extracting key entities from material information through NER (named entity recognition ) entity extraction in the industrial field, and optimizing BERT (bi-directional encoder characterization quantity from a converter, bidirectional Encoder Representations from Transformers) classification effect; clustering is carried out by using kmeans (k-means clustering algorithm ), unreasonable leaf types in a current material system are combined and corrected, noise generated by error classification is reduced by optimizing the leaf type system, and BERT classification effect is improved;
according to the material classification optimization method for BERT, NER entity extraction and knowledge graph provided by the invention, as shown in figures 1-3, the material classification optimization method comprises the following steps:
step S1: processing the basic text and cleaning the material data;
specifically, in the step S1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
step S1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
step S1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
Step S2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
specifically, in the step S2:
step S2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
step S2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
Step S3: training a BERT model based on the material data with the entity tag;
specifically, in the step S3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
Step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
Specifically, in the step S4:
step S4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
step S4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
step S4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
Example 2:
example 2 is a preferable example of example 1 to more specifically explain the present invention.
The invention also provides a material classification optimizing system of the BERT, NER entity extraction and the knowledge graph, which can be realized by executing the flow steps of the material classification optimizing method of the BERT, NER entity extraction and the knowledge graph, namely, a person skilled in the art can understand the material classification optimizing method of the BERT, NER entity extraction and the knowledge graph as a preferred implementation mode of the material classification optimizing system of the BERT, NER entity extraction and the knowledge graph.
According to the invention, the material classification optimization system for BERT, NER entity extraction and knowledge graph comprises:
module M1: processing the basic text and cleaning the material data;
specifically, in the module M1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
module M1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
module M1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
Module M2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
specifically, in the module M2:
module M2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
module M2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
Module M3: training a BERT model based on the material data with the entity tag;
specifically, in the module M3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
Module M4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
Specifically, in the module M4:
module M4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
module M4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
module M4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
Example 3:
example 3 is a preferable example of example 1 to more specifically explain the present invention.
Step 1: basic text processing
The model training sample data is from commodity data in a company commodity database and comprises commodity names, commodity descriptions, and commodity major, middle and leaf (leaf refers to the narrowest class of commodities, for example, commodity "tile shaft ZWZ single-row deep groove ball bearings 61964M-EG", major "bearings", middle class "rolling bearings", and leaf is "single-row deep groove ball bearings") fields. And (3) segmenting the material names and the description data by combining an artificially marked industrial product word stock, and removing words with unclear semantics such as other words, other words and the like. And removing leaves with too few samples, and reducing potential bias hidden trouble.
Step 2: entity extraction
And extracting entity information in the material data by using the NER model, and marking corresponding labels.
Step 3: BERT classification training
Based on the material data with the entity tags, the BERT classifier is trained.
Step 4: kmeans optimized classification system
And embedding the material information based on the trained BERT model, correcting and classifying the material vector clusters by using kmeans, and training the BERT classifier again until the classification F1-score of all the classes is higher than a threshold value or the classification effect is not improved.
kmeans is an unsupervised classification method, and can find an optimal center point set by continuously iterating and replacing center positions on the premise of determining the number of centers according to the distribution condition of the points in a vector space, and cluster the nearby points. In a material classification scene, the kmeans can be used for classifying the materials in a semantic level again, so that the materials with similar semantics are ensured to be classified in the same category, and the problems of redundant classification, uneven granularity and the like caused by complete manual classification are avoided.
After the kmeans are clustered to obtain a clustering result, the clustering is learned and predicted through the bert, the final prediction accuracy is output, and when the accuracy does not reach the standard, the kmeans parameters are adjusted to cluster again, and the steps are repeated.
Here kmeans' role is to search for semantically reasonable classification systems, while bert is used to verify whether the semantic features of the system are sufficiently explicit to be accurately predicted. Separate kmeans cannot be used as a classifier, but bert as a supervised model cannot modify and iterate the classification system itself like kmeans, which complement each other to be able to train a reliable classifier while optimizing the classification system.
The detailed steps are as follows:
the step 1 comprises the following steps (or cases):
step 1.1: firstly, based on a word stock of the industrial field accumulated for a long time in the our country, a jieba word segmentation tool is used for segmenting the text in a training sample. And removing nonsensical stop words after word segmentation.
Step 1.2: counting the quantity of materials contained in each leaf, deleting the leaf with the sample quantity smaller than 20, and preventing classification errors from being aggravated because BERT can not learn the leaf characteristics.
The step 2 comprises the following steps (or cases):
step 2.1: entity extraction is performed by using the NER model, and the label of entity extraction and the meaning thereof are shown in fig. 2:
the labeling data is used to train entity extraction based on the bert, for example, an oxygen stop valve DN80 model specification DN80JY41W-16T PN1.6, an oxygen stop valve is labeled as a material, and a model specification is labeled as DN80 and Y41W-16T PN1.6. The labeling uses a BIOE method, and the labeled data is delivered to a NER model based on the BERT-CRF structure for training. The BERT-CRF can carry out ebadd on each token and the position information of each sentence, learn the possible labels of each token under specific context and specific position through marking data, and extract the entities in the text in a mode of predicting the labels.
For example, when "the oxygen stop valve DN80 model specification DN80JY41W-16T PN1.6" is input, the NER model can predict that the oxygen word is a B-COM label, the B meaning is the beginning, the COM meaning is the material, that is, the NER model can predict that the oxygen word is the beginning character of the material, the three words of middle gas stop are respectively predicted as I-COM labels, that is, the middle character of the material entity, and the valve is predicted as E-COM labels, that is, the end character of the material entity. The irrelevant characters are predicted as O-tags.
Step 2.2: the results of the entity extraction are inserted into the raw material text data.
With an "oxygen shut-off valve DN80; for example, the model specification DN80JY41W-16T PN1.6' is input, NER can obtain the result as shown in FIG. 2, and a total of 15 labels are obtained, and the meaning of the labels comprises low-point, material, operation, brand and the like. And writing the specification of the model specification of the oxygen and the model DN80 and the model specification of the measurement and the model Z DN80JY41W-16T and the model PN1.6 and the model into a txt file in a format of 'oxygen and the model stop valve and the model D80 and the model Z' respectively for training processes of the follow-up BERT models.
The step 3 comprises the following steps:
step 3.1: and (3) training a BERT classification model according to the data set extracted by the entity in the step (2).
BERT is a set of pre-trained language models that enable the formation of BERT-based classifiers through BERT plus softmax layers.
By inputting the data of the industrial material information and the material classification, the BERT can learn the language habit and the material classification method in the industrial material field, fine tune the weight network embedded in the semantics on the basis of the pre-training model, train the weight network of the classifier, and enable the classifier to better understand the industrial material information on one hand, and construct the classifier to classify the materials on the other hand.
The step 4 comprises the following steps:
step 4.1: after the BERT training is finished, the classification accuracy of the BERT on all categories can be observed, F1-score is the harmonic average of the accuracy and the recall, a threshold value is set by judging the distribution condition of F1-score, the leaf categories with classification effects lower than the threshold value are taken out together with the misclassified leaf categories related to the classification results lower than the threshold value, the leaf categories are considered to have poor classification effects in the iteration, and clustering is performed again.
The leaves are clustered again by kmeans, the best clustering quantity is found by referring to elbow cut iteration, and the leaves are used as new classification of the leaves and participate in the next iteration.
Step 4.2: and clustering the kmeans in a mode of searching for a clustering center, wherein each leaf class generates a corresponding clustering center, calculating a class center distance by using Manhattan distance, and selecting two types of leaf classes with the minimum class center distance to consider and combine.
Step 4.3: repeating steps 2 and 3 again for NER extraction and BERT training, the process will iterate until either all classes of F1-score are above the threshold α or all classes of corrections can no longer improve model performance.
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims (10)

1. The utility model provides a BERT, NER entity extraction and knowledge-graph material classification optimization method which is characterized by comprising the following steps:
step S1: processing the basic text and cleaning the material data;
step S2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
step S3: training a BERT model based on the material data with the entity tag;
step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
2. The method for optimizing material classification according to claim 1, wherein in step S1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
step S1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
step S1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
3. The method for optimizing material classification according to claim 1, wherein in step S2:
step S2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
step S2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
4. The method for optimizing material classification according to claim 1, wherein in step S3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
5. The method for optimizing material classification according to claim 1, wherein in step S4:
step S4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
step S4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
step S4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
6. A material classification optimization system for BERT, NER entity extraction and knowledge graph, comprising:
module M1: processing the basic text and cleaning the material data;
module M2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;
module M3: training a BERT model based on the material data with the entity tag;
module M4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.
7. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M1:
model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:
module M1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;
module M1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.
8. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M2:
module M2.1: entity extraction using the NER model:
training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;
module M2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.
9. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M3:
training a BERT classification model according to the data set extracted by the entity;
the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.
10. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M4:
module M4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;
module M4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;
module M4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.
CN202310215293.4A 2023-03-06 2023-03-06 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system Pending CN116186266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215293.4A CN116186266A (en) 2023-03-06 2023-03-06 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215293.4A CN116186266A (en) 2023-03-06 2023-03-06 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Publications (1)

Publication Number Publication Date
CN116186266A true CN116186266A (en) 2023-05-30

Family

ID=86438347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215293.4A Pending CN116186266A (en) 2023-03-06 2023-03-06 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Country Status (1)

Country Link
CN (1) CN116186266A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401369A (en) * 2023-06-07 2023-07-07 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401369A (en) * 2023-06-07 2023-07-07 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms
CN116401369B (en) * 2023-06-07 2023-08-11 佰墨思(成都)数字技术有限公司 Entity identification and classification method for biological product production terms

Similar Documents

Publication Publication Date Title
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN109783818B (en) Enterprise industry classification method
CN111160037B (en) Fine-grained emotion analysis method supporting cross-language migration
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN114416942A (en) Automatic question-answering method based on deep learning
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113434688B (en) Data processing method and device for public opinion classification model training
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN116186266A (en) BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system
CN111783464A (en) Electric power-oriented domain entity identification method, system and storage medium
CN114328939A (en) Natural language processing model construction method based on big data
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN112579730A (en) High-expansibility multi-label text classification method and device
CN113157918B (en) Commodity name short text classification method and system based on attention mechanism
CN112231476A (en) Improved graph neural network scientific and technical literature big data classification method
CN112347247A (en) Specific category text title binary classification method based on LDA and Bert
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN115827871A (en) Internet enterprise classification method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination