CN116611001B

CN116611001B - Near infrared spectrum data classification method based on multidimensional self-adaptive incremental graph

Info

Publication number: CN116611001B
Application number: CN202310883092.1A
Authority: CN
Inventors: 信晓伟; 宫会丽; 贾俊华; 高小燕; 胡若彤; 丁香乾
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-03
Anticipated expiration: 2043-07-19
Also published as: CN116611001A

Abstract

The invention discloses a near infrared spectrum data classification method based on a multi-dimensional self-adaptive incremental graph, which is characterized in that near infrared spectrum data of a sample to be detected are preprocessed and subjected to feature dimension reduction, multi-dimensional self-adaptive similarity measurement is carried out based on the feature data after dimension reduction and the spectrum data after preprocessing, a spectrum incremental graph is constructed by utilizing the measured similarity value, label information of a labeled spectrum sample is transmitted to an unlabeled spectrum sample according to the spectrum incremental graph, and finally the spectrum sample is classified.

Description

Near infrared spectrum data classification method based on multidimensional self-adaptive incremental graph

Technical Field

The invention belongs to the technical field of industrial information and data processing, and particularly relates to a near infrared spectrum data classification method based on a multidimensional self-adaptive incremental graph.

Background

With the penetration of artificial intelligence (artificial intelligence, AI) technology into industries, modern industries have become a trend to provide new solutions using AI technology, and the industry has evolved toward automation, informatization, and intelligence. For certain industrial fields, classification of different brands of places, locations and grades of raw materials and auxiliary materials required or produced in the production process and brands of final products is important for innovation research and quality evaluation of subsequent industrial products.

Near infrared spectroscopy (near infrared spectroscopy, NIRS) is the most popular classification method due to its rapid, non-destructive, green, etc. characteristics. By virtue of the development of the Internet of things and the sensing technology, the industrial data sensing technology has breakthrough progress, and enterprises can easily collect a large amount of unlabeled industrial near infrared spectrum data.

An excellent data-based classification model requires a large amount of labeling information, but since a large amount of manpower, material resources and financial resources are required for acquiring expert labeling information, it is difficult to acquire a large amount of labeling information in practical application. If only a small amount of spectrum marking samples are used for model training, the problems of low model precision, poor generalization performance and the like are often caused, and a large amount of unlabeled industrial spectrum data resources are wasted. Therefore, how to combine the unlabeled spectrum sample and the labeled sample, fully mine the characteristic information of the unlabeled spectrum sample, and improve the model performance becomes a concern in academia and industry.

Disclosure of Invention

The invention provides a near infrared spectrum data classification method based on a multidimensional self-adaptive incremental graph, which fully utilizes a large amount of industrial unlabeled near infrared spectrum data to realize a modeling target under the condition of a large amount of unlabeled spectrum data, meets the classification requirements of raw materials and auxiliary materials or final products required in the industrial production process, reduces the dependence on field experts, and effectively saves cost and time.

The invention is realized by adopting the following technical scheme:

the near infrared spectrum data classification method based on the multidimensional self-adaptive incremental graph comprises the following steps:

s1, collecting and preprocessing near infrared spectrum data of a sample to be detected, and marking a part of spectrum samples by category;

s2, performing characteristic dimension reduction on the preprocessed spectrum data;

s3, based on feature data and pre-processing after dimension reductionAnd carrying out multidimensional self-adaptive similarity measurement on the processed spectrum data:the method comprises the steps of carrying out a first treatment on the surface of the Wherein A, B is a sample of two spectra,for the Euclidean distance of the two spectral samples after dimension reduction, < >>Cosine distance for the pre-processed spectral samples, < >>The spectral angular distance of the pretreated spectral sample; />、/>And->Is an adaptive parameter of similarity;

s4, utilizing the similarity valueConstructing an adjacency matrix, and constructing a spectrum incremental graph model by utilizing the adjacency matrix;

s5, transmitting the marking information of the marked spectrum sample to the unmarked spectrum sample based on the spectrum increment diagram;

s6, classifying the spectrum sample processed in the S5;

and S7, when a new sample exists, preprocessing the new sample and reducing the feature dimension, calculating the similarity value with each existing spectrum sample based on S3, transmitting the labeling information of the spectrum sample with the largest similarity value to the new sample, and classifying the new sample.

In some embodiments of the present invention, step S5 specifically includes:

define aProbability transition matrix->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is defined as a transfer coefficient representing the slave node +.>To node->Is a transition probability of (2): />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For slave nodes->To the nodeIs a delta map weight;

define aMatrix of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the number of classes, add>For the number of classes with labels, line number +.>Indicate->Personal lightThe one-hot label of the spectrum sample represents a vector;

define aMatrix of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The number of unlabeled samples;

merging matrixAnd->Obtaining a soft mark matrix->；

Definition matrixTo realize that each spectrum sample transmits the labeling information to other spectrum samples without labeling information, and the transition probability is represented by +.>Determining;

resettingMarking of the marked spectral samples +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representation->A labeled spectrum sample;

refreshing the marks of all nodes in turn until convergence conditions are met, so that each node isEach spectrum sample has unique labeling information。

In some embodiments of the present invention, step S3 further includes: determining self-adaptive parameters of similarity according to labeling information of the labeled spectrum sample, optimizing and determining final self-adaptive parameters, wherein the self-adaptive parameters comprise:

calculating cosine distances of all spectrum samples with labeling information under each categoryAnd spectral angular distanceAnd Euclidean distance of spectrum sample after dimension reduction +.>Averaging to obtain distance sets of all classesThe following steps are:

，

and

obtain->，/>，/>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The number of categories.

In some embodiments of the present invention, the feature dimension reduction is performed on the preprocessed spectrum data in S2 by using a PCA or TSNE dimension reduction method.

In some embodiments of the present invention, the processed spectrum sample is classified in step S6 using an SVM, KNN, or RF classification algorithm.

Compared with the prior art, the invention has the advantages and positive effects that: in the near infrared spectrum data classification method based on the multidimensional self-adaptive incremental graph, near infrared spectrum data acquisition is carried out on samples to be detected of different brands, places, parts and grades which are needed or generated in the industrial production process, spectrum data preprocessing is carried out, then feature dimension reduction is carried out on the preprocessed spectrum data by using a dimension reduction method, multidimensional self-adaptive similarity measurement is carried out on the feature data after dimension reduction and the preprocessed spectrum data, and self-adaptive parameters of the similarity are determined according to the marked samples; then constructing an adjacency matrix by using the similarity value, and constructing a spectrum incremental graph model based on the adjacency matrix; transmitting the marking information of the marked spectrum samples to the unmarked spectrum samples based on the spectrum increment graph, so that each spectrum sample has unique marking information; finally, classifying all spectrums by using a classification algorithm; when a new spectrum sample is added, repeating the steps of preprocessing and feature dimension reduction, calculating the similarity value of the spectrum sample and each existing spectrum sample, transmitting the labeling information of the spectrum sample with the largest similarity value to the newly added sample, and classifying the newly added sample by using a classification method. Based on the classification method provided by the invention, the limited marked near infrared spectrum data of different brands, places, parts and grades required or generated in the process of generating industrial products can be effectively classified, the similarity measure of the near infrared spectrum sample can be comprehensively considered from multiple view angles, and the self-adaptive parameters of the similarity can be self-adaptively adjusted according to priori knowledge、/>And->And furthermore, an incremental graph model capable of fully representing the sample relation is constructed, the time complexity is low, the implementation is easy, an extended incremental learning mode can be realized, and technology and data support are provided for subsequent industrial data analysis and intelligent decision.

Other features and advantages of the present invention will become more apparent from the following detailed description of embodiments of the present invention, which is to be read in connection with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of the processing steps of a multi-dimensional adaptive incremental graph-based spectral data processing method according to the present invention;

fig. 2 is a schematic flow chart of spectrum data processing based on a multidimensional adaptive incremental graph.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, semi-supervised learning is a mainstream technology of unmarked data learning, and the feature of deep mining of unmarked data information is based on a small amount of marked data as a guide. Therefore, semi-supervised learning can solve the problem of insufficient labeling of spectral data in the industrial field, and machine learning models can be trained on a small amount of labeled spectral data and a large amount of unlabeled spectral data.

The current semi-supervised classification learning method comprises the following steps: a generative method, a semi-supervised support vector machine, a graph-based method, and a bifurcation-based method. The graph-based method has been widely studied by students because of its low computational complexity, strong flexibility, and expandability to large-scale data.

However, in some areas, such as protein structure prediction, there is already an implicit underlying graph, so graph-based methods can be used, but for most other areas, the sample data is not visually represented as a graph structure, and a graph must be built pre-using the graph-based method to fit these problems to the graph semi-supervised method. And the current mainstream unsupervised composition method ignores label information when a sample learns the graph, thereby wasting valuable priori knowledge. And in the process of composition, the traditional graph construction method (such as KNN) is mainly based on Euclidean distance, and cannot capture the global structure of spectrum data. Meanwhile, model scalability is poor.

Moreover, for industrial spectrum data, due to the characteristics of high dimension, overlapping, nonlinearity, redundancy and the like, problems such as dimension disasters, difficult information presentation, failure of a traditional distance calculation method and the like can occur in the construction process of the graph.

Based on the consideration, the invention provides a near infrared spectrum data classification method based on a multi-dimensional self-adaptive incremental graph aiming at the problems, which makes full use of prior information and multi-dimensional self-adaptive weight measure to construct an incremental graph model and finally solves the problem of insufficient marking of industrial near infrared spectrum data.

Specifically, the near infrared spectrum data classification method based on the multidimensional adaptive incremental graph provided by the invention, as shown in fig. 1 and 2, comprises the following steps:

s1: and collecting and preprocessing near infrared spectrum data of the sample to be detected, and marking the type of the partial spectrum sample.

Construction of the inclusionNear industrySet of IR spectrum samples->As a subsequent input sequence, corresponding +.>The set of individual markers is defined as +.>。

S2: and performing characteristic dimension reduction on the preprocessed spectrum data.

ConstructionDimension feature set +.>，/>Representing the feature vector, and carrying out feature dimension reduction on the preprocessed spectrum data by adopting a PCA or TSNE dimension reduction method.

S3: and carrying out multidimensional self-adaptive similarity measurement based on the feature data after dimension reduction and the spectrum data after pretreatment.

Assume two spectral samplesAndthe multi-dimensional adaptive similarity value of (2) is:

；

the Euclidean distance of the two spectrum samples after the dimension reduction is calculated by using the following formula:

；

the cosine distance of the preprocessed spectral sample is calculated using the following formula:

，

；

the spectral angular distance of the pretreated spectral samples was calculated using the following company:

。

in the above-mentioned manner,、/>and->For the adaptive parameter of similarity, according to the labeling information, for +.>、/>And->Optimization is performed and a value is finally determined. Specifically, assuming that there are C categories in total, calculating cosine distances ++for all spectrum samples with labeling information under each category respectively>And spectral angular distance>And Euclidean distance of spectrum sample after dimension reduction +.>Obtaining the average valueDistance set of all classes->The following steps are:

，

and

the method comprises the steps of carrying out a first treatment on the surface of the Then there are:

，/>，/>。

compared with the prior method which only uses a single distance to calculate the similarity in the composition process, such as (1), or (3) or (4), the invention uses the formula (5) to combine multiple distances and uses the labeling information to carry out self-adaptive parameters、/>Andthe Euclidean distance is calculated in the characteristic space, and the other distances are calculated in the original spectrum space, so that the sample graph representation capability can be remarkably improved by the cross-space calculation mode.

S4: and constructing an adjacency matrix by using the similarity values, and constructing a spectrum increment graph model by using the adjacency matrix.

By means ofS3 of the aboveConstructing an adjacency matrix of the spectrum sample, and constructing a spectrum incremental graph model by utilizing the adjacency matrix>Wherein->Representing a set of spectral samples, +.>Representing a collection of edges, +.>Representing a set of weights; in the incremental graph, the vertex represents a sample, the edge represents the relationship between two sample points, the weight of the edge and the edge represents the similarity distance between the two samples, and the relationship is represented by +.>Obtaining; />And->For spectrum sample i and spectrum sample j, similarity distance +.>Then it is calculated by step S2.

S5: and transferring the labeling information of the labeled spectrum sample to the unlabeled spectrum sample based on the spectrum increment graph.

The core idea of the step is that similar spectrum data should have the same mark, the topological structure of a spectrum increment graph is used as a guide, information is captured from a marked spectrum sample, and then the marked information is transmitted to an unmarked spectrum sample on the graph; the greater the weight of an edge, the more similar the two.

The method specifically comprises the following steps:

(1) Define aProbability transition matrix->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is defined as a transfer coefficient representing the slave node +.>To node->Is a transition probability of (2):

。

(2) Define aMatrix of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the number of classes, add>For the number of classes with labels, line number +.>Indicate->One-hot labels of individual spectral samples represent vectors; i.e. if%>The category of the individual sample is->The class of the class is defined as,its flag indicates that the vector is +.>With 1 for each value and 0 for the remaining values.

(3) Define aMatrix of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The number of unlabeled samples.

(4) Merging matrixAnd->Obtaining a soft mark matrix->。

(5) Definition matrixTo realize that each spectrum sample transmits the labeling information to other spectrum samples without labeling information, and the transition probability is represented by +.>Determining; the more similar two nodes are, the easier they are to have the same annotation information.

(6) ResettingMarking of the marked spectral samples +.>Wherein->Representation->A labeled spectrum sample;

because the marker with the annotation data is predetermined and cannot be changed, it must be returned to its original marker each time it propagates.

(7) The marks of all nodes are refreshed in sequence until convergence conditions are met, so that each spectrum sample has unique marking information。

In the existing spectrum classification method, only limited marked spectrum sample data is used, a large amount of unmarked spectrum samples are wasted, and the marked information of the marked spectrum samples is transmitted to the unmarked spectrum samples based on the mode provided by the invention, so that the unmarked spectrum samples can be fully utilized, and the training efficiency is remarkably improved.

And S6, classifying the spectrum sample processed in the step S5.

All spectral samples were classified using a machine-learned classical classification algorithm, such as SVM, KNN, or RF, and classification performance metrics were calculated, accuracy, precision, recall, and F1-Score, etc.

And S7, when a new sample exists, preprocessing the new sample and reducing the feature dimension, calculating the similarity value of the new sample and each spectrum sample based on S3, transmitting the labeling information of the spectrum sample with the largest similarity value to the new sample, and classifying the new sample.

When a new sample is presentWhen in use, firstly, spectrum pretreatment is carried out on the spectrum data of the newly added sample, then the step S2 is repeated, and the similarity value of each sample with the prior sample is calculated by using the step S3>And then, transferring the sample label with the maximum similarity value to a new sample. And re-classifying by using the classifying algorithmAnd (5) predicting. The existing spectrum classification method does not support incremental learning or lifelong learning, and when a new sample is added, the model needs to be retrained, but the classification method provided by the invention can fully utilize the expandability of the incremental graph and does not need to be re-patterned.

Compared with the traditional near infrared spectrum classification method, the near infrared spectrum data classification method based on the multi-dimensional self-adaptive incremental graph provided by the invention has the advantages that a large number of unlabeled spectrum samples are fully utilized, so that spectrum characteristics can be better mined, and the precision is higher. Meanwhile, compared with other semi-supervised classification methods, the method provided by the invention is based on a multidimensional self-adaptive incremental graph, can mine data information from multiple angles, is more suitable for industrial near infrared spectrum scenes, and has low time complexity and expandability.

The spectrum data classification method based on the multi-dimensional self-adaptive incremental graph provided by the invention is compared with the existing classification method.

As shown in table one, the classification accuracy is lower as the result of the existing classification using only the labeled spectrum data.

List one

The second table shows the result of comparing the near infrared spectrum data classification method using the multi-dimensional adaptive incremental graph provided by the invention with other single distance classification methods, and the classification performance of the method of the invention can be seen to be due to other single distance methods.

Watch II

* The invention provides a near infrared spectrum data classification method of a multidimensional self-adaptive incremental graph.

Table III shows the comparison results of different indexes of the classification effect of the near infrared spectrum data classification method using the multidimensional self-adaptive incremental graph provided by the invention and the classification effect of the existing other classification methods.

Watch III

It should be noted that, in the specific implementation process, the above method part may be implemented by executing, by a processor in a hardware form, computer execution instructions in a software form stored in a memory, which is not described herein, and the program corresponding to the executed action may be stored in a computer readable storage medium of the system in a software form, so that the processor invokes and executes the operations corresponding to the above modules.

The computer readable storage medium above may include volatile memory, such as random access memory; but may also include non-volatile memory such as read-only memory, flash memory, hard disk, or solid state disk; combinations of the above types of memories may also be included.

The processor referred to above may be a general term for a plurality of processing elements. For example, the processor may be a central processing unit, or may be other general purpose processors, digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or may be any conventional processor or the like, but may also be a special purpose processor.

It should be noted that the above description is not intended to limit the invention, but rather the invention is not limited to the above examples, and that variations, modifications, additions or substitutions within the spirit and scope of the invention will be within the scope of the invention.

Claims

1. The near infrared spectrum data classification method based on the multidimensional self-adaptive incremental graph is characterized by comprising the following steps of:

s3, performing multidimensional self-adaptive similarity measurement based on the feature data after dimension reduction and the spectrum data after pretreatment:the method comprises the steps of carrying out a first treatment on the surface of the Wherein A, B is a sample of two spectra,for the Euclidean distance of the two spectral samples after dimension reduction, < >>Cosine distance for the pre-processed spectral samples, < >>The spectral angular distance of the pretreated spectral sample; />、/>And->Is an adaptive parameter of similarity;

s6, classifying the spectrum sample processed in the S5;

s7, when a new sample exists, preprocessing the new sample and reducing the feature dimension, calculating the similarity value with each existing spectrum sample based on S3, transmitting the labeling information of the spectrum sample with the largest similarity value to the new sample, and classifying the new sample;

the step S5 specifically comprises the following steps:

define aProbability transition matrix->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is defined as a transfer coefficient representing the slave node +.>To the nodeIs a transition probability of (2): />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For slave nodes->To node->Is a delta map weight;

define aMatrix of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the number of classes, add>For the number of classes with labels, line number +.>Indicate->One-hot labels of individual spectral samples represent vectors;

merging matrixAnd->Obtaining a soft mark matrix->；

the marks of all nodes are refreshed in sequence until convergence conditions are met, so that each spectrum sample has unique marking information。

2. The method of classifying near infrared spectrum data based on a multi-dimensional adaptive delta map according to claim 1, wherein step S3 further comprises: determining self-adaptive parameters of similarity according to labeling information of the labeled spectrum sample, optimizing and determining final self-adaptive parameters, wherein the self-adaptive parameters comprise:

，

and

3. The near infrared spectrum data classification method based on the multi-dimensional self-adaptive incremental graph according to claim 1, wherein the feature dimension reduction is performed on the preprocessed spectrum data by using a PCA or TSNE dimension reduction method in S2.

4. The method for classifying near infrared spectrum data based on a multi-dimensional adaptive incremental map according to claim 1, wherein the processed spectrum sample is classified in step S6 by using SVM, KNN, or RF classification algorithm.