CN116186266A

CN116186266A - BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Info

Publication number: CN116186266A
Application number: CN202310215293.4A
Authority: CN
Inventors: 夏竟翔; 沈达峰; 朱俊; 姚泽坤; 闫晨光; 李燕北; 孙志强; 戴智鑫
Original assignee: Ouye Industrial Products Co ltd
Current assignee: Ouye Industrial Products Co ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-05-30

Abstract

The invention provides a material classification optimization method and system for BERT and NER entity extraction and knowledge graph, comprising the following steps: step S1: processing the basic text and cleaning the material data; step S2: extracting entity information in the material data by using the NER model, and marking corresponding labels; step S3: training a BERT model based on the material data with the entity tag; step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material vector clusters by using kmeans, and training the BERT classifier again until the classification accuracy of all the categories is higher than a preset value. According to the invention, the NER entity extraction model is adopted to extract the structure of the key entity in the material information, so that the material information of the original material data is enhanced, and the problem that important text information is difficult to quickly focus during BERT model training is solved.

Description

BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Technical Field

The invention relates to the field of deep learning, in particular to a material classification optimization method and system based on BERT and NER entity extraction and knowledge-graph.

Background

Numerous material classification methods currently use bert+lstm, bert+cnn, bert+crf, or BERT model variants in an attempt to achieve better classification by complicating BERT, but this method has limited improvement, typically below 5%, with a significant increase in training costs.

Patent document CN110413785B (application number: CN 201910675003.8) discloses an automatic text classification method based on bert and feature fusion, comprising: firstly, cleaning text data, converting the text into dynamic word vectors through BERT, extracting the characteristics of the text by using CNN and BiLSTM, and respectively transmitting word vector sequences output by BERT to a CNN network and a BiLSTM network; and then splicing the output of the CNN network and the output of the BiLSTM network together for feature fusion, and finally outputting a final predictive probability vector through the full connection layer and the softmax layer. Although the BERT+CNN+BiLSTM classification method is used, the model training cost is greatly increased, and the effect improvement is limited.

Patent document CN110334210A (application number: CN 201910462751.8) discloses a Chinese emotion analysis method based on BERT and LSTM, CNN fusion. The method comprises the following steps: text preprocessing is carried out on a plurality of Chinese corpora in the Chinese corpus data set to obtain a plurality of sequences corresponding to the plurality of Chinese corpora; extracting word embeddings for each sequence using a BERT model; extracting features of each sequence by adopting BERT, LSTM and CNN to obtain deep semantic features of the text corresponding to each sequence; the obtained text deep semantic features are classified by using a softmax classifier to train and test the model, so that emotion polarity prediction analysis is realized. However, the invention increases the model training cost and has limited effect improvement.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a material classification optimization method and system for BERT and NER entity extraction and knowledge graph.

The invention provides a material classification optimization method for BERT, NER entity extraction and knowledge graph, which comprises the following steps:

step S1: processing the basic text and cleaning the material data;

step S2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;

step S3: training a BERT model based on the material data with the entity tag;

step S4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.

Preferably, in said step S1:

model training sample data includes name, description, and major, intermediate, and leaf fields; the method comprises the steps of combining a manually marked word stock to word the name and description data, removing words with semantics not meeting preset standards, and removing leaf types with samples less than the preset standards:

step S1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;

step S1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.

Preferably, in said step S2:

step S2.1: entity extraction using the NER model:

training entity extraction based on the BERT by using marking data, marking by using a BIOE method, giving the marked data to a NER model based on a BERT-CRF structure for training, carrying out ebedding on the token and the position information of the sentence by using the BERT-CRF, learning labels which can exist in each token under specific context and specific position by using the marking data, and extracting the entity in the text in a mode of predicting the labels;

step S2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.

Preferably, in said step S3:

training a BERT classification model according to the data set extracted by the entity;

the classifier based on BERT is formed by the BERT and softmax layers, the BERT learning language habit and the material classification method are enabled to be conducted by inputting industrial material information and material classification data, the weight network embedded in the semantics is finely adjusted on the basis of the pre-training model, and the weight network of the classifier is trained.

Preferably, in said step S4:

step S4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;

step S4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;

step S4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.

According to the invention, the material classification optimization system for BERT, NER entity extraction and knowledge graph comprises:

module M1: processing the basic text and cleaning the material data;

module M2: extracting entity information in the material data by using the NER model, and marking corresponding entity labels;

module M3: training a BERT model based on the material data with the entity tag;

module M4: and embedding the material information based on the trained BERT model, correcting and classifying the material clusters by using kmeans, and training the BERT model again until the classification accuracy of all the categories is higher than a preset value.

Preferably, in said module M1:

module M1.1: based on a word stock, using a jieba word segmentation tool to segment the text in the training sample, and removing stop words which do not meet preset standards after word segmentation;

module M1.2: counting the quantity of materials contained in each leaf, and deleting the leaf with the sample quantity smaller than a preset value.

Preferably, in said module M2:

module M2.1: entity extraction using the NER model:

module M2.2: and inserting the result of entity extraction into original material text data, and obtaining labels by NER, wherein the meanings comprise low points, materials, operations and brands, and writing the labels into txt files for the training process of the follow-up BERT model.

Preferably, in said module M3:

Preferably, in said module M4:

module M4.1: after BERT training is finished, observing classification accuracy of BERT on all categories, setting a threshold value for F1-score as harmonic average of accuracy and recall, taking out leaf categories with classification effects lower than the threshold value and the leaf categories which are related to the classification results lower than the threshold value and are misclassified, carrying out kmeans clustering again, iteratively finding the optimal clustering quantity, and taking the best clustering quantity as new classification to participate in the next iteration;

module M4.2: the kmeans performs clustering in a mode of searching a clustering center, each leaf class generates a corresponding clustering center, manhattan distance is used for calculating class center distance, and two classes of leaf classes with the minimum class center distance are selected to be combined;

module M4.3: NER extraction and BERT training are performed again, and iteration is performed until all classes of F1-score are above a threshold alpha or the model performance reaches a preset standard.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the NER entity extraction model is adopted to extract the structure of the key entity in the material information, so that the material information of the original material data is enhanced, and the problem that important text information is difficult to quickly focus during BERT model training is solved;

2. the invention combines or corrects the leaf class by adopting kmeans clustering, and iterates BERT again until the model performance can not be continuously improved, thereby solving the problem of classification noise caused by unreasonable material system and optimizing the performance of BERT classification;

3. the method can remove too few leaf types of the sample and reduce potential bias hidden trouble.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is an entity extraction tag and tag meaning;

fig. 3 is an example of NER extraction results.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1:

classifying materials, extracting key entities from material information through NER (named entity recognition ) entity extraction in the industrial field, and optimizing BERT (bi-directional encoder characterization quantity from a converter, bidirectional Encoder Representations from Transformers) classification effect; clustering is carried out by using kmeans (k-means clustering algorithm ), unreasonable leaf types in a current material system are combined and corrected, noise generated by error classification is reduced by optimizing the leaf type system, and BERT classification effect is improved;

according to the material classification optimization method for BERT, NER entity extraction and knowledge graph provided by the invention, as shown in figures 1-3, the material classification optimization method comprises the following steps:

step S1: processing the basic text and cleaning the material data;

specifically, in the step S1:

specifically, in the step S2:

step S2.1: entity extraction using the NER model:

Step S3: training a BERT model based on the material data with the entity tag;

specifically, in the step S3:

Specifically, in the step S4:

Example 2:

example 2 is a preferable example of example 1 to more specifically explain the present invention.

The invention also provides a material classification optimizing system of the BERT, NER entity extraction and the knowledge graph, which can be realized by executing the flow steps of the material classification optimizing method of the BERT, NER entity extraction and the knowledge graph, namely, a person skilled in the art can understand the material classification optimizing method of the BERT, NER entity extraction and the knowledge graph as a preferred implementation mode of the material classification optimizing system of the BERT, NER entity extraction and the knowledge graph.

module M1: processing the basic text and cleaning the material data;

specifically, in the module M1:

specifically, in the module M2:

module M2.1: entity extraction using the NER model:

specifically, in the module M3:

Specifically, in the module M4:

Example 3:

example 3 is a preferable example of example 1 to more specifically explain the present invention.

Step 1: basic text processing

The model training sample data is from commodity data in a company commodity database and comprises commodity names, commodity descriptions, and commodity major, middle and leaf (leaf refers to the narrowest class of commodities, for example, commodity "tile shaft ZWZ single-row deep groove ball bearings 61964M-EG", major "bearings", middle class "rolling bearings", and leaf is "single-row deep groove ball bearings") fields. And (3) segmenting the material names and the description data by combining an artificially marked industrial product word stock, and removing words with unclear semantics such as other words, other words and the like. And removing leaves with too few samples, and reducing potential bias hidden trouble.

Step 2: entity extraction

And extracting entity information in the material data by using the NER model, and marking corresponding labels.

Step 3: BERT classification training

Based on the material data with the entity tags, the BERT classifier is trained.

Step 4: kmeans optimized classification system

And embedding the material information based on the trained BERT model, correcting and classifying the material vector clusters by using kmeans, and training the BERT classifier again until the classification F1-score of all the classes is higher than a threshold value or the classification effect is not improved.

kmeans is an unsupervised classification method, and can find an optimal center point set by continuously iterating and replacing center positions on the premise of determining the number of centers according to the distribution condition of the points in a vector space, and cluster the nearby points. In a material classification scene, the kmeans can be used for classifying the materials in a semantic level again, so that the materials with similar semantics are ensured to be classified in the same category, and the problems of redundant classification, uneven granularity and the like caused by complete manual classification are avoided.

After the kmeans are clustered to obtain a clustering result, the clustering is learned and predicted through the bert, the final prediction accuracy is output, and when the accuracy does not reach the standard, the kmeans parameters are adjusted to cluster again, and the steps are repeated.

Here kmeans' role is to search for semantically reasonable classification systems, while bert is used to verify whether the semantic features of the system are sufficiently explicit to be accurately predicted. Separate kmeans cannot be used as a classifier, but bert as a supervised model cannot modify and iterate the classification system itself like kmeans, which complement each other to be able to train a reliable classifier while optimizing the classification system.

The detailed steps are as follows:

the step 1 comprises the following steps (or cases):

step 1.1: firstly, based on a word stock of the industrial field accumulated for a long time in the our country, a jieba word segmentation tool is used for segmenting the text in a training sample. And removing nonsensical stop words after word segmentation.

Step 1.2: counting the quantity of materials contained in each leaf, deleting the leaf with the sample quantity smaller than 20, and preventing classification errors from being aggravated because BERT can not learn the leaf characteristics.

The step 2 comprises the following steps (or cases):

step 2.1: entity extraction is performed by using the NER model, and the label of entity extraction and the meaning thereof are shown in fig. 2:

the labeling data is used to train entity extraction based on the bert, for example, an oxygen stop valve DN80 model specification DN80JY41W-16T PN1.6, an oxygen stop valve is labeled as a material, and a model specification is labeled as DN80 and Y41W-16T PN1.6. The labeling uses a BIOE method, and the labeled data is delivered to a NER model based on the BERT-CRF structure for training. The BERT-CRF can carry out ebadd on each token and the position information of each sentence, learn the possible labels of each token under specific context and specific position through marking data, and extract the entities in the text in a mode of predicting the labels.

For example, when "the oxygen stop valve DN80 model specification DN80JY41W-16T PN1.6" is input, the NER model can predict that the oxygen word is a B-COM label, the B meaning is the beginning, the COM meaning is the material, that is, the NER model can predict that the oxygen word is the beginning character of the material, the three words of middle gas stop are respectively predicted as I-COM labels, that is, the middle character of the material entity, and the valve is predicted as E-COM labels, that is, the end character of the material entity. The irrelevant characters are predicted as O-tags.

Step 2.2: the results of the entity extraction are inserted into the raw material text data.

With an "oxygen shut-off valve DN80; for example, the model specification DN80JY41W-16T PN1.6' is input, NER can obtain the result as shown in FIG. 2, and a total of 15 labels are obtained, and the meaning of the labels comprises low-point, material, operation, brand and the like. And writing the specification of the model specification of the oxygen and the model DN80 and the model specification of the measurement and the model Z DN80JY41W-16T and the model PN1.6 and the model into a txt file in a format of 'oxygen and the model stop valve and the model D80 and the model Z' respectively for training processes of the follow-up BERT models.

The step 3 comprises the following steps:

step 3.1: and (3) training a BERT classification model according to the data set extracted by the entity in the step (2).

BERT is a set of pre-trained language models that enable the formation of BERT-based classifiers through BERT plus softmax layers.

By inputting the data of the industrial material information and the material classification, the BERT can learn the language habit and the material classification method in the industrial material field, fine tune the weight network embedded in the semantics on the basis of the pre-training model, train the weight network of the classifier, and enable the classifier to better understand the industrial material information on one hand, and construct the classifier to classify the materials on the other hand.

The step 4 comprises the following steps:

step 4.1: after the BERT training is finished, the classification accuracy of the BERT on all categories can be observed, F1-score is the harmonic average of the accuracy and the recall, a threshold value is set by judging the distribution condition of F1-score, the leaf categories with classification effects lower than the threshold value are taken out together with the misclassified leaf categories related to the classification results lower than the threshold value, the leaf categories are considered to have poor classification effects in the iteration, and clustering is performed again.

The leaves are clustered again by kmeans, the best clustering quantity is found by referring to elbow cut iteration, and the leaves are used as new classification of the leaves and participate in the next iteration.

Step 4.2: and clustering the kmeans in a mode of searching for a clustering center, wherein each leaf class generates a corresponding clustering center, calculating a class center distance by using Manhattan distance, and selecting two types of leaf classes with the minimum class center distance to consider and combine.

Step 4.3: repeating steps 2 and 3 again for NER extraction and BERT training, the process will iterate until either all classes of F1-score are above the threshold α or all classes of corrections can no longer improve model performance.

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. The utility model provides a BERT, NER entity extraction and knowledge-graph material classification optimization method which is characterized by comprising the following steps:

step S1: processing the basic text and cleaning the material data;

step S3: training a BERT model based on the material data with the entity tag;

2. The method for optimizing material classification according to claim 1, wherein in step S1:

3. The method for optimizing material classification according to claim 1, wherein in step S2:

step S2.1: entity extraction using the NER model:

4. The method for optimizing material classification according to claim 1, wherein in step S3:

5. The method for optimizing material classification according to claim 1, wherein in step S4:

6. A material classification optimization system for BERT, NER entity extraction and knowledge graph, comprising:

module M1: processing the basic text and cleaning the material data;

7. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M1:

8. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M2:

module M2.1: entity extraction using the NER model:

9. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M3:

10. The BERT, NER entity extraction and knowledge-graph material classification optimization system of claim 6, wherein in the module M4: