CN110619363A

CN110619363A - Classification method for subclass names corresponding to long description of material data

Info

Publication number: CN110619363A
Application number: CN201910877234.7A
Authority: CN
Inventors: 隋怡; 杨浩东; 张复生
Original assignee: Shaanxi Top 100 Information Technology Co Ltd
Current assignee: Shaanxi Top 100 Information Technology Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-27

Abstract

The invention discloses a classification method for subclasses corresponding to long description of material data. The classification of subclasses of material data can accurately analyze the problems in the data, such as case/full half angle, connector, unit non-uniformity and similar pronunciation, carry out reasonable data preprocessing process, standardize and standardize the data, then convert the data into a form of characteristic vector, and classify the data by adopting a method of logistic regression, L2 regularization and L-BFGS optimization.

Description

Classification method for subclass names corresponding to long description of material data

Technical Field

The invention relates to the technical field of material data classification, in particular to a classification method for subclasses corresponding to long description of material data.

Background

The material master data contains a description of the materials purchased, produced and stored in inventory by all enterprises. Which is a material database of materials data related to material information (e.g., inventory levels) in an enterprise. Integrating all material data into a single material database eliminates the problem of data redundancy and allows the purchasing department to use the data as well as other application departments (e.g., inventory management, material planning and control, invoice verification, etc.). The material classification means that materials with the same natural attributes are classified according to a certain arrangement order and a certain combination mode. The basic standard of classification by natural attributes is required to be followed as much as possible in the material classification process, the existing material classification efficiency is low, and the phenomenon of classification errors is easy to occur.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, the invention aims to provide a classification method of subclasses corresponding to the description of the material data length.

According to the classification method of the subclass names corresponding to the long description of the material data, the method comprises the following steps:

s1: raw material data: reading data of the original material;

s2: data preprocessing: preprocessing the read-in original material data, and standardizing the data;

s3: class-to-number: encoding the original material data category column into numbers;

s4: dividing a sample set: dividing a sample set into a training set and a testing set;

s5: vectorizing the characteristics: converting the material length description into a characteristic vector form;

s6: and (4) classification: obtaining an objective function through learning, and mapping each feature set to a predefined class label;

s7: and (4) evaluating the classification result: the classification results were evaluated by accuracy, recall, and F1 values.

The S2 includes the following steps:

s21: unifying the original material data unit and the connector;

s22: brackets and slashes are removed;

s23: after Chinese word segmentation, text-to-pinyin conversion is carried out;

s24: upper case to lower case and full angle to half angle.

The original material data in the S3 includes a material data length description and a subclass name.

The dividing ratio of the sample set in the S4 is that the ratio of the training set sample size to the test set sample size is 7: 3.

the feature vectorization method in S5 is the tf-idf algorithm.

The material length in S5 is described as material text data.

The classification method in S6 includes logistic regression, naive Bayes, decision trees, support vector machines, K neighbors, random forests, GBDT, XGboost, neural networks and the like.

The metrics for evaluating the classification results in S7 include accuracy, recall, and F1 values.

The beneficial effects of the invention are as follows: the classification of subclasses of material data can accurately analyze the problems existing in the data, such as case/full half angle, connector, unit non-uniformity and similar pronunciation, carry out a reasonable data preprocessing process, standardize and standardize the data, then convert the data into a form of a feature vector, and classify the data by adopting a logistic regression + L2 regularization + L-BFGS optimization method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a classification method for names of subclasses corresponding to description of material data length according to the present invention;

FIG. 2 is a flow chart of data preprocessing in a classification method of names of subclasses corresponding to long description of material data according to the present invention;

fig. 3 is a flowchart of an example of data preprocessing in the classification method of the names of the subclasses corresponding to the description of the material data length provided by the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views, and merely illustrate the basic structure of the present invention, and therefore, they show only the components related to the present invention.

Referring to fig. 1-2, a method for classifying names of corresponding subclasses of material data length description comprises the following steps:

s1: raw material data: reading data of the original material;

S2 includes the steps of:

s21: unifying the original material data unit and the connector;

s22: brackets and slashes are removed;

s23: after Chinese word segmentation, text-to-pinyin conversion is carried out;

s24: upper case to lower case and full angle to half angle.

And in S3, the original material data are long description of material data and name of subclass.

the feature vectorization method in the S5 is tf-idf algorithm.

The material length in S5 is described as material text data.

The classification method in S6 includes logistic regression, naive Bayes, decision trees, support vector machines, K neighbors, random forests, GBDT, XGboost and neural networks.

The method for evaluating the classification result in the S7 is a logistic regression method, a naive Bayes method, a decision tree method, a support vector machine method, a K neighbor method, a random forest method and an XGboost method.

Data preprocessing:

due to the problems of non-uniform capital and small case, non-uniform full half angle, non-uniform multiplier, non-uniform space, non-uniform underline and non-uniform cross bar, non-uniform metering units, non-uniform input word sequence, similar pronunciation and the like of the material data, the preprocessing operation of the data is carried out before the data is converted into the characteristic vector, and the data is normalized and standardized.

Example 2.1:

the material data length describes a radial bearing \ N40/50/20T6540 tilting pad, and the results of the pretreatment process are as follows:

example 2.2:

the long description of the original material data and the subclass name are as follows:

the data length of the pretreated materials is described as follows:

kebian danhuang zhijia df07kfa116 2327n 2747n 9↑q 321002jda zuhe jian

shimian xiangjiaodian pian cl300 dn25 xb350 gaf sh3401

wufeng santong dn50*dn50 sch120 sch120 sh t3408 15crmo gb9948

shourong redianou redianou wrp–131 0–1600s xing l＝900

shourong ruhua beng yeya guan 32*5m

class-to-number:

to facilitate the classification task, the category columns are all encoded into numbers.

Example 3.1:

the subclass name of the raw material data is encoded into a number:

dividing a sample set:

a test sample set is typically required to evaluate the generalization error of the classifier. Therefore, the sample set is divided into a training set and a testing set, and after the classifier is trained by the training sample set, the testing error on the testing sample set is used as an approximation of the generalization error. The dividing proportion of the sample set in the invention is the sample amount of the training set: the test set sample size was 7: 3.

Vectorizing the characteristics:

the independent variable of the classification task is a continuous real-valued vector, so the material length description (text data) is converted into a feature vector form. The text vectorization method mainly comprises a bag-of-words model and a tf-idf algorithm. In consideration of the characteristics of material data, the invention adopts a tf-idf algorithm to carry out feature vectorization.

the tf-idf algorithm is a statistical method for assessing the importance of a word to a document in a document set or corpus. The main idea is as follows: if a word occurs with a high frequency (tf) in one article and rarely occurs in other articles, the word is considered to have a good class distinction capability and is suitable for classification. the tf-idf algorithm is widely applied to search engines, keyword extraction, text similarity, text summarization and the like.

(1) The word frequency (tf) represents the frequency of occurrence of words in the text, and the calculation formula is as follows:

namely, it isWherein n is_i,jIs that the word is in the document D_jThe number of times of occurrence of (a),is a file D_jThe sum of the number of occurrences of all words in (a).

(2) The inverse document frequency (idf) is the logarithm of the ratio of the number of files that contain the total number of files to the number of files for a particular term. The calculation formula is as follows:

namely, it isWhere | D | is the total number of files in the corpus, | { j: w ∈ D_jIs the number of files containing the word w. The denominator is added to prevent the case where the word w is not in the corpus resulting in a denominator of 0.

The more files containing the word w, the larger idf value is, and the word has good category distinguishing capability.

(3) tf-idf＝tf×idf

High frequency terms in a particular document, and low document frequency of the term across the document collection, may result in a high weighting of tf-idf. tf-idf tends to filter out common words, preserving important words.

Example 5.1:

preprocessed material data

kebian danhuang zhijia df07kfa116 2327n 2747n 9↑q 321002jda zuhe jian

shimian xiangjiaodian pian cl300 dn25 xb350 gaf sh3401

wufeng santong dn50*dn50 sch120 sch120 sh t3408 15crmo gb9948

shourong redianou redianou wrp–1310–1600s xing l＝900

shourong ruhua beng yeya guan 32*5m

Expressed in the form of a feature vector:

[0 0 0 0 0 0 0 0 0 0 0 0.35355339 0 0 0.35355339 0 0.35355339 0 0 0 0 0 0.35355339 0 0 0 0 0 0 0 0.35355339 0.35355339 0 0 0 0 0.353553390.35355339 0 0 0 0]

[0.2811506 0.2811506 0.0.2811506 0 0 0 0 0 0 0.2811506 0 0 0 0 0 0 0 0 0 0 0 0.2811506 0 0 0.5623012 0 0.2811506 0 0 0 0 0 0.22683053 0 0.28115060 0 0 0.2811506 0 0 0]

[0 0 0 0 0 0 0.38775666 0 0.38775666 0 0 0.38775666 0 0 0 0 0 0 0 0.38775666 0 0 0 0 0 0 0.38775666 0 0 0 0 0 0 0.31283963 0 0 0 0 0 00.38775666 0 0]

[0 0 0.26726124 0 0 0 0 0 0 0 0 0 0 0 0 0 0.53452248 0 0.26726124 0 0 0 0 0 0 0 0 0 0.26726124 0.53452248 0.26726124 0 0 0 0.26726124 0 0.267261240 0 0 0 0 0]

[0 0 0 0 0.30151134 0.30151134 0 0.30151134 0 0.30151134 0 0 0 0.30151134 0.30151134 0 0 0 0 0 0.30151134 0.30151134 0 0 0.30151134 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0.30151134 0.30151134]

and (4) classification:

the classification task is to obtain an objective function through learning, and map each feature set x to a predefined class label y_i。

The current mainstream classification method comprises the methods of logistic regression, naive Bayes, decision trees, support vector machines, K neighbor, random forests, GBDT, XGboost, neural networks and the like, and after the material data characteristics are fully considered, the invention adopts the logistic regression method added with the L2 regular term and uses the L-BFGS algorithm to carry out iterative solution.

And (4) evaluating the classification result:

the main metrics for evaluating the classification result are: accuracy, recall, and F1 values.

(1) Rate of accuracy

Accuracy is, as the name implies, the proportion of correctly sorted samples to the total number of samples. The calculation formula is as follows:

(2) recall rate

The recall rate is also called recall rate, and represents the proportion of the number of correctly classified samples of the good cases to the total number of samples of the good cases, and the calculation formula is as follows:

where TP represents the number of correctly classified positive examples and FN represents the number of incorrectly classified positive examples.

(3) F1 value

The F1 value is the harmonic mean of precision and recall, i.e.

Rate of accuracyWhere FP represents the number of misclassified negative samples.

Example 7.1:

in order to evaluate and compare the classification effect of the classification method on the material data sets, classification is performed on 50000 material data sets (total 1995 subclass categories) and 20564 material data sets (total 1213 subclass categories) by using logistic regression, naive Bayes, decision trees, support vector machines, K neighbor, random forests and XGboost methods respectively, and the classification result metric on the test set is shown in the following table.

	Rate of accuracy	Recall rate	F1 value
				logistic regression	0.88	0.90	0.89
Naive Bayes	0.60	0.65	0.57
				Decision tree	0.84	0.82	0.82
Support vector machine	0.06	0.13	0.17
				K nearest neighbor	0.84	0.82	0.82
Random forest	0.89	0.89	0.88
				XGBoost	0.67	0.73	0.69

The table above shows the comparison of the results of different classification methods on 50000 material data sets.

	Rate of accuracy	Recall rate	F1 value
				logistic regression	0.88	0.90	0.89
Naive Bayes	0.64	0.73	0.65
				Decision tree	0.87	0.89	0.87
Support vector machine	0.18	0.22	0.18
				K nearest neighbor	0.82	0.82	0.80
Random forest	0.86	0.84	0.84
				XGBoost	0.69	0.73	0.71

The table above is a comparison of the results of different classification methods on the 20564 material data sets.

From the two tables, the average classification effect of the logistic regression + L2 regularization + L-BFGS method adopted by the invention is superior to that of other classification methods.

The logistic regression model uses probabilistic estimates to classify. The latent variable y is assumed to represent the possibility of occurrence of a certain event to be researched, the value range of the latent variable y is the whole real number, and the larger the value of the latent variable y is, the higher the possibility of occurrence of the event is. The logistic regression model is widely applied to economic prediction, disaster weather prediction and auxiliary medical diagnosis.

For the material data classification problem, the event to be researched is that a long description of material data is classified into a certain subclass class. And (3) analyzing the internal association between the material data characteristics (namely words in the long description) and the subclass class by using logistic regression so as to predict the subclass class to which the material data belongs.

If the classification is binary, the independent variable is x to represent the characteristic of long description, and y_iIndicates the likelihood that the bar description belongs to subclass class i, y_i1 indicates belonging to this category, y_i0 means not belonging to this category.

Assuming that the predicted values are linear combinations of features, the relationship between the predicted values z and the independent variables x generated by the linear regression model is as follows:

z＝w^Tx+b

to convert real-valued z to 0/1 values, z is assumed to obey a logistic distribution, i.e.

Then the probability that the long description belongs to that category is

The above formula can be changed into

Is obviously provided with

The objective function of logistic regression is

W and b in the model can be estimated by maximum likelihood. The likelihood function of logistic regression is

Taking logarithm of likelihood function for convenient calculation

Maximizing the likelihood function is equivalent to minimizing

The maximum likelihood estimate is easily over-fitted and therefore a regularization term can be added to the objective function. Commonly used regularization terms are L1 regularization and L2 regularization. Adding an L2 regular term according to the prior characteristics of the material data

This is an unconstrained convex optimization problem. According to the convex optimization theory, a Newton-Raphson method is generally adopted for solving. As can be seen from the above formula, all training samples are required to solve the problem, and matrix inversion operation is required for each iteration during optimization of the Newton-Raphson method. In consideration of high dimensionality of text features, in order to reduce the calculation amount, an approximate algorithm, such as an L-BFGS algorithm, Newton-CG and the like, can be adopted for solving. The invention adopts an L-BFGS algorithm to solve.

The L-BFGS algorithm is the most common method for solving the unconstrained nonlinear programming problem, has the advantages of high convergence rate, low memory overhead and the like, and is suitable for large-scale calculation.

Let us assume that the unconstrained problem is defined as minf (x), x ∈ Rⁿ

f (x) at x^(k)At a second order Taylor expansion of

Since the extreme point of f (x) satisfiesNeglecting the last remainder and taking the derivative to obtain

Thus the iterative formula of Newton's method is

As can be seen from the above equation, Newton's method requires x to be calculated for each iteration^(k)The inverse of the Hessian matrix is processed, and the Hessian matrix is not necessarily fixed, so that the inverse of the Hessian matrix is approximated by a matrix containing no second derivative, namely a quasi-newton method, and different construction methods of the approximation matrix determine different quasi-newton methods.

The BFGS algorithm uses a matrix B_k+1To approximate Hessian matrixIs calculated by the formula

Wherein p is^(k)＝x^(k+1)-x^(k)，Order toThe BFGS formula can be obtained

Let y_k＝q_k，s_k＝p_k，The above formula can be rewritten as

Order toThenThe L-BFGS only takes the nearest m groups of data to construct an approximate calculation formula each time, namely

The pseudo-code for the L-BFGS algorithm is as follows:

the classification of subclasses of material data can accurately analyze the problems in the data, such as case/full half angle, connector, unit non-uniformity, similar pronunciation and the like, carry out a reasonable data preprocessing process, standardize and standardize the data, convert the data into a characteristic vector form, and classify the data by adopting a logistic regression + L2 regularization + L-BFGS optimization method. The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and their inventive concepts of the present invention with the equivalent alternatives or modifications within the scope of the present invention.

Claims

1. A method for classifying names of subclasses corresponding to long description of material data comprises the following steps:

s1: raw material data: reading data of the original material;

s7: and (4) evaluating the classification result: the classification result is evaluated by a classification result metric.

2. The method for classifying names of subclasses corresponding to long description of material data according to claim 1, wherein said S2 comprises the following steps:

s21: unifying the original material data unit and the connector;

s22: brackets and slashes are removed;

s23: after Chinese word segmentation, text-to-pinyin conversion is carried out;

s24: upper case to lower case and full angle to half angle.

3. The method for classifying names of subclasses corresponding to material data length description according to claim 1, wherein the original material data in S3 is material data length description and subclass name.

4. The method for classifying names of subclasses corresponding to long description of material data according to claim 1, wherein in step S4, the sample set is divided into training set samples and testing set samples in a ratio of 7: 3.

5. the method for classifying names of subclasses corresponding to material data length descriptions according to claim 1, wherein the material length descriptions in S5 are material text data, and the feature vectorization method is tf-idf algorithm.

6. The method for classifying names of subclasses corresponding to long description of material data according to claim 1, wherein the classification method in S6 includes logistic regression, naive Bayes, decision trees, support vector machines, K neighbors, random forests, GBDT, XGBoost, neural networks and the like.

7. The method for classifying names of corresponding subclasses described in claim 1, wherein said measures of evaluating classification results in S7 include accuracy, recall rate and F1 value.