CN116359420A

CN116359420A - Chromatographic data impurity qualitative analysis method based on clustering algorithm and application

Info

Publication number: CN116359420A
Application number: CN202310378197.1A
Authority: CN
Inventors: 李奇文; 柳彦宏; 刘鹏飞; 张�浩
Original assignee: Yantai Guogong Intelligent Technology Co ltd
Current assignee: Yantai Guogong Intelligent Technology Co ltd
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-06-30
Anticipated expiration: 2043-04-11
Also published as: CN116359420B

Abstract

The invention relates to the technical field of impurity analysis methods, in particular to a chromatographic data impurity qualitative analysis method based on a clustering algorithm and application thereof, wherein the method comprises the steps of calculating relative retention time; preliminary clustering, namely performing random sample clustering on the preset number of chromatographic data, and forming a cluster set by clustering results; and (3) classifying and reclustering, wherein the clustering is performed on the primary clustering result, and a cluster set of the most clusters after all chromatograms are calculated is obtained, namely an impurity model of the substance. The chromatogram is compared with the impurity model to confirm that the chromatogram is abnormal in impurity number, new impurity and single impurity area. The method comprises the steps of clustering and reclustering of chromatographic data based on KMeans and Euclidean distance algorithm, and comparing the chromatographic data with an impurity model, and is mainly used for controlling the discovery of new impurities, the quantity and the area of the impurities in the fine chemical production process, solving the problems of low manual discrimination efficiency and discrimination errors and improving the working efficiency.

Description

Chromatographic data impurity qualitative analysis method based on clustering algorithm and application

Technical Field

The invention relates to a chromatographic data impurity qualitative analysis method based on a clustering algorithm and application thereof, belonging to the technical field of impurity analysis methods.

Background

In the fine chemical industry, an important means for controlling the quality of the production process is to obtain a chromatogram through inspection of a chromatograph to judge the quantity of impurities and the area of single impurities. Because of various products in the fine chemical industry and the reason of small batch production, a large number of samples to be inspected are produced every day, most of the samples are inspected by a chromatograph, such as impurity condition detection of raw materials, quality control of a production process, quality control of finished products and the like, and workers need to visually compare peak images in a chromatogram every day and classify and count impurities.

At present, most quality control personnel count a large amount of data of chromatogram peaks every day or every week to carry out impurity classification and confirmation, and many unidentified peaks need to search the chromatogram to carry out naked eye identification and classification. And the current chromatogram and other chromatograms need to be taken for comparison of each peak in the chromatographic test process, so as to confirm whether the impurity quantity is increased or the area of a single impurity is increased. This also increases the effort and risk of quality control errors.

Based on the reasons, how to reduce the chromatogram classification statistics work, the comparison among chromatograms, find new impurities, increase the number of the impurities and abnormal area of single impurities through the chromatogram data is a technical problem to be solved.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a chromatographic data impurity qualitative analysis method based on a clustering algorithm and application thereof, which can realize convenient and quick impurity alignment, solve the problem that impurity alignment can only be performed by staff through historical experience at the present stage, and can realize reminding and early warning of new impurity and abnormal impurity and impurity trend analysis, thereby effectively improving the working efficiency.

The technical scheme for solving the technical problems is as follows: a qualitative analysis method for chromatographic data impurities based on a clustering algorithm comprises the following steps:

s1, calculating relative retention time

Calculating the relative retention time of impurity peaks in an existing chromatogram of a certain substance;

s2, preliminary clustering

Clustering the preset number of chromatogram data by using a K-Means algorithm according to the impurity peak relative retention time calculated in the step S1 as a sample to obtain a cluster set, and marking cluster identifiers for all clusters in the cluster set;

the preliminary clustering output result comprises: clustering cluster identifiers of all cluster types and central retention time of each cluster type;

s3, clustering and reclassifying

Aiming at the cluster integration result of the step S2, searching out the data of the most impurity peaks in the existing chromatographic data of each cluster, carrying out clustering calculation by using a K-Means algorithm by taking the relative retention time of the most impurity peaks as a sample to form an impurity data model, calculating the average area of the impurity peaks gathered into each cluster in the clustering process, and calculating the maximum relative retention time and the minimum relative retention time of all the impurity peaks contained in each cluster as impurity boundaries;

the clustering reclassifying result output comprises: cluster identification of impurity peaks in the impurity data model, central retention time of each cluster, average area of the impurity peaks in each cluster and impurity boundaries;

s4, comparing new chromatogram data with impurity data models

The Euclidean distance between the peak to be judged in the new chromatogram and the retention time of the center of the corresponding cluster in the impurity model is shortest; judging whether the relative retention time of the peak is within the impurity boundary of the corresponding cluster; comparing the area of the peak to be judged with the average area of the impurity peaks in the corresponding clusters to confirm the impurity quantity of the new chromatogram, the new impurity confirmation and the abnormal condition of the single impurity area;

the output result of step S4 is: each impurity peak cluster identification in the new chromatogram, the relative retention time of each impurity peak, the area of each impurity peak and the impurity quantity of the new chromatogram.

Further, in step S1, the relative retention time of the impurity peak is calculated according to the following formulas (1) - (2):

Rt＝t-t*coef(2)

wherein coef is the time offset coefficient of the impurity peak, T is the retention time of the impurity peak, T is the retention time of the main peak, and Rt is the relative retention time of the impurity peak;

further, the main peak is a peak with a peak area of more than or equal to 95% in the chromatogram; the impurity peak is a peak with a peak area of more than or equal to 10ppt after the main peak is removed from the chromatogram.

Further, in step S4, when the relative retention time of the peak to be judged is shortest with the euclidean distance of the retention time of the center of a certain corresponding cluster in the impurity model, and the relative retention time of the peak to be judged falls within the impurity boundary of the certain corresponding cluster, if the area of the peak to be judged is greater than 120% of the average area of the impurity peaks in the certain corresponding cluster, the alarm is given by considering that the area of the single impurity of the peak to be judged is too large.

Further, in step S4, when the relative retention time of the peak to be determined is the shortest euclidean distance from the retention time of a certain corresponding cluster center in the impurity model, but the relative retention time of the peak to be determined is outside the impurity boundary of the certain corresponding cluster, the peak to be determined is considered to be a new impurity, and an alarm is given.

The invention also discloses application of the chromatographic data impurity qualitative analysis method based on the clustering algorithm, wherein the method is applied to analysis of impurity peaks in chromatograms in the chemical production or research and development process, and whether the number of impurities is increased or the area of single impurities is increased is confirmed.

Further, the method is applied to the quality control of production raw materials, the quality control of production processes and the quality control of finished products.

The beneficial effects of the invention are as follows:

in the conventional method, quality inspection personnel are required to confirm the retention time through experience, but the retention time is calculated in a mathematical formula mode, so that the accuracy of data is obviously improved; the impurity peaks of samples of various varieties are classified by common quality control personnel through experience, and the classification is more scientific and has basis by a mathematical mode of preliminary clustering and cluster reclassifying, so that the occurrence of human misjudgment is reduced.

The method can realize impurity control alarm in the fine chemical production process, confirms the identification of peaks through comparison of chromatogram data and an impurity model, and then realizes the alarm of increasing the number of impurities by comparing the identification of peaks between chromatograms with the corresponding peaks, so as to discover new impurities and single impurity area abnormal conditions in time.

The method can solve the problems of low manual discrimination efficiency and discrimination errors, and can effectively improve the working efficiency and the discrimination accuracy by controlling impurities through a computer program.

Drawings

Fig. 1 is a flowchart of a qualitative analysis method for chromatographic data impurities based on a clustering algorithm in the embodiment.

FIG. 2 is a schematic diagram of the preliminary clustering described in the examples;

FIG. 3 is a schematic diagram of cluster reclassification as described in the examples.

Detailed Description

The following describes the present invention in detail. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, so that the invention is not limited to the specific embodiments disclosed.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In this application, the impurity control service is typically implemented by a computer program, taking the production process of an OLED intermediate chemical as an example, the whole process is as follows: impurity model setting, impurity model training, production raw material quality control, production process quality control and finished product warehousing quality control, and a chromatographic data impurity qualitative analysis method based on a clustering algorithm is shown in figure 1.

Step one, calculation of relative retention time

The method comprises the steps of presetting raw material impurity data models for OLED intermediates and OLED intermediates, training 200 chromatograms (chromatograms can be liquid phase chromatograms or gas phase chromatograms, and corresponding chromatograms are selected according to detection requirements of chemicals), wherein main peaks are peaks with peak areas of more than or equal to 95% in the chromatograms, and taking peaks with areas of more than 10ppt as impurity peaks according to requirements of operation instruction except the main peaks.

The chromatographic instrument causes that the farther the other peaks of the chromatogram are from the main peak, the larger the offset time is, so that the time offset coefficient coef of all impurity peaks except the main peak in all chromatogram data needs to be calculated, the specific calculation formula is as follows (1), the relative retention time Rt of the impurity peaks is calculated after the time offset coefficient is obtained, and the calculation formula is as follows (2):

Rt＝t-t*coef(2)

where coef is the time offset coefficient of the impurity peak, T is the retention time of the main peak, and Rt is the relative retention time of the impurity peak.

Step two, preliminary clustering

When the sample is subjected to chromatographic inspection, 200 chromatograms of the same sample are found forward according to time sequence, the main peak and the peak with the peak area of less than 10ppt are removed, and the impurity peaks of the chromatograms are subjected to K-Means algorithm clustering operation, wherein the process is as follows:

according to the formula in the first step, the relative retention time of all impurity peaks is calculated.

The chromatogram sample is composed of a plurality of impurity peaks, and assuming that the chromatogram a is composed of peak values of A1, A2, A3, etc., the expression is a { A1, A2, A3.

And (3) finding a chromatogram data sample with the most impurity peaks in all chromatograms of the sample, and selecting the peak relative retention time of the random data amount as the cluster retention time after clustering (assuming that D { D1, D2, D3.. Degree. }).

The preliminary classification is performed as shown in fig. 2: selecting a chromatogram data sample (assumed as E { E1, E2, E3 … … }) and calculating the Euclidean distance from the relative retention time of the impurity peak in the chromatogram data sample to the retention time of the center of the D { D1, D2, D3 … … } cluster, and attributing the Euclidean distance to the cluster corresponding to the center retention time with the smallest Euclidean distance, for example: e1 and D1 are considered to be the same cluster, assuming that the cluster is identified as B1, the average retention time of D1 and E1 is 4.4, the central retention time of 4.4 identified as B1 cluster is reset, and other impurity peaks of D and E chromatogram data are processed as above to form cluster set B { B1, B2, B3 … … }.

Repeating the above operation with all other chromatogram data and B { B1, B2, B3 … … } until all chromatogram data are iterated to form a cluster set.

Outputting a result: the cluster clusters cluster identification of all cluster classes, the center retention time of each cluster class.

Step three, cluster reclassifying

And (3) clustering again each cluster containing the most peak retention time according to the result of the preliminary clustering in the step two, wherein the specific implementation method is as follows:

as shown in fig. 3, the most impurity peaks in the chromatogram data contained in each cluster in the preliminary clustering result are found according to the cluster classification, assuming that the preliminary clusters reach cluster clusters L { L1, L2, the chromatogram data sample H { H1 containing the most impurity peaks in the L1 cluster class in the L3....the above, h1, H2, H3 in H2, H3 … …), the L2 cluster contains J3 and J4 of the chromatogram data samples J { J1, J2 and J3 … … } with the largest impurity peaks. And so on to get the set { H1, H2, H3, J4 … … } containing the maximum number of impurity peaks for all the preliminary cluster classes L.

And carrying out K-Means clustering calculation on 200 pieces of chromatogram data based on the relative retention time of the impurity peaks to form an impurity model, calculating the average area of all the impurity peaks in one cluster in the clustering process to form the average area of the impurity peaks, and taking the maximum relative retention time and the minimum relative retention time of all the impurity peaks contained in a new cluster as impurity boundaries.

And (3) outputting results: cluster identification of impurity peaks in the impurity data model, center retention time of each cluster, average area of impurity peaks in each cluster, and impurity boundaries.

Fourth, comparing the chromatogram data with the impurity data model

The chromatogram data and the impurity data model are compared, and the specific method is as follows:

assuming that a { A1, A2, A3 … … } is an impurity model, the properties of impurity elements in the impurity model include: cluster identity, center retention time, average area, impurity boundary, chromatogram data for which B { B1, B2, B3 … … } needs to be aligned, the properties of the chromatogram data peaks include: relative retention time, area.

The relative retention time of all impurity peaks of B { B1, B2, B3 … … } and Euclidean distance of the retention time of all element cluster centers in the A { A1, A2, A3 … … } set are calculated.

Assuming that B1 is closest to the A1 retention time Euclidean distance, the retention time of B1 is within the impurity boundaries of A1. Then the B1 impurity is considered A1 and if the area of B1 is greater than 120% of the average area of A1, an alarm is given that the area of the single impurity of B1 is too large.

Assuming that the relative retention time of B2 is closest to the euclidean distance of the retention time of A2, the retention time of B2 is outside the impurity boundary of A2, then the B2 impurity is considered to be a new impurity for warning.

And (5) performing approximation comparison on all elements of the data B and all elements of the data A, and marking cluster type labels.

Outputting a result: cluster identification of impurity peaks, relative retention time, area, number of impurities of the chromatogram.

The specific application scene is as follows:

1. and (3) controlling the quality of the production raw materials, carrying out chromatographic inspection on raw material samples before warehousing the raw materials, obtaining chromatographic data, and searching three batches of chromatographic data of sample inspection before current sample inspection. And carrying out step two and step three on the chromatographic data of 200 batches of samples inspected before the four batches are time-ordered to obtain an impurity data model, carrying out step four on the four batches of chromatographic data and the impurity data model one by one, finding new impurities to alarm, finding that a certain impurity is 120% larger than the average area of the corresponding impurities of the impurity model to alarm, finding that the impurity number of the current inspection chromatogram is larger than the impurity number of any one batch of the first three batches, and finding that a certain impurity of the current inspection chromatogram is a new impurity relative to the first three batches of impurities to alarm.

2. And (3) controlling the quality of the production process, and performing chromatographic inspection on a sample in a certain process in the production process to obtain chromatogram data.

And (3) transversely comparing and early warning, searching chromatogram data of the current sample and the similar samples of the orders, sequentially searching 200 sample chromatograms of other orders in the same process in time, and executing the third step and the fourth step to obtain an impurity data model. And step five, executing the current sample and other samples in the same process as the order and the impurity data model. And (3) discovering new impurities to alarm, namely discovering that a certain impurity is 120% larger than the average area of the corresponding impurity of the impurity model, and discovering that the impurity number of the current inspection chromatogram is larger than the impurity number of any one of the other similar samples in the same process of the same order to alarm, and discovering that a certain impurity of the current inspection chromatogram is a new impurity relative to the impurity of the similar samples in the same process of the same order to alarm.

And (3) longitudinally comparing and early warning, namely searching for a chromatogram data result which is compared with the impurity data model by the sample in the previous processing process of the current sample in the current production processing process, and if the impurity number of the current sample is larger than that of the previous processing process, carrying out early warning. In the production process of chemicals, after the reaction is finished, a plurality of post-treatment processes are needed to remove impurities, and the impurity quantity condition is gradually decreased, so that if more samples exist in the next treatment process than the impurities in the last treatment process, the problems are possibly caused, and the method can be used for giving early warning to remind workers of checking the production process and the product condition.

3. And the quality control of the finished product is consistent with the control mode of the production raw materials.

By adopting the method, a large amount of chromatographic data samples of a tap enterprise which is an OLED intermediate in China are used. The accuracy of the comparison of the results of the impurity data model trained on 12000 chromatograms of 60 samples and the retention time of the impurity report classified by daily statistics of the enterprise reaches 98.3%, the other 0.9% are also in the range specified by the enterprise, and the fact that 16 impurities are not counted by a customer is found, which exceeds the expectations of quality control personnel. The result obtained by comparing 200 chromatogram data of a certain substance with an impurity data model has a similarity of 94.5% with the daily comparison record of the enterprise, and the other 3% are mistakes or peaks which are not recorded as impurities by quality control personnel. Through result analysis, the method is reliable and effective, and the reasons of the extremely small part of inaccuracy are the overlarge deviation of the chromatograph of the enterprise and the different inspection methods of inspection staff.

The method provided by the invention carries out correct peak classification and comparison between the chromatogram data through the chromatogram data, solves the problems of low manual discrimination efficiency and discrimination errors, effectively improves the working efficiency and reduces labor force.

The technical features of the above-described embodiments may be arbitrarily combined, and in order to simplify the description, all possible combinations of the technical features in the above-described embodiments are not exhaustive, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims.

Claims

1. The qualitative analysis method for chromatographic data impurities based on the clustering algorithm is characterized by comprising the following steps of:

s1, calculating relative retention time

s2, preliminary clustering

s3, clustering and reclassifying

s4, comparing new chromatogram data with impurity data models

2. The qualitative analysis method of chromatographic data according to claim 1, wherein in step S1, the relative retention time of the impurity peak is calculated according to the following formulas (1) - (2):

Rt＝t-t*coef(2)

3. The qualitative analysis method of chromatographic data impurities based on a clustering algorithm according to claim 2, wherein the main peak is a peak with a peak area of more than or equal to 95% in a chromatogram; the impurity peak is a peak with a peak area of more than or equal to 10ppt after the main peak is removed from the chromatogram.

4. The qualitative analysis method of chromatographic data impurities based on a clustering algorithm according to claim 1, wherein in step S4, when the relative retention time of the peak to be judged is shortest in euclidean distance from the retention time of a center of a certain corresponding cluster in the impurity model, and the relative retention time of the peak to be judged falls within the impurity boundary of the certain corresponding cluster, if the area of the peak to be judged is greater than 120% of the average area of the impurity peaks in the certain corresponding cluster, the single impurity area of the peak to be judged is considered to be too large for alarming.

5. The qualitative analysis method of chromatographic data impurity based on clustering algorithm according to claim 1, wherein in step S4, when the relative retention time of the peak to be judged and the euclidean distance of the retention time of the center of a certain corresponding cluster in the impurity model are shortest, but the relative retention time of the peak to be judged is outside the impurity boundary of the certain corresponding cluster, the peak to be judged is regarded as a new impurity for alarming.

6. The application of the qualitative analysis method of chromatographic data impurities based on a clustering algorithm according to claims 1-5, wherein the method is applied to analysis of impurity peaks in chromatograms in chemical production or research and development processes, and is used for confirming whether the number of impurities is increased or the area of single impurities is increased.

7. The application of the qualitative analysis method for chromatographic data impurities based on a clustering algorithm according to claim 6, wherein the method is applied to the quality control of production raw materials, the quality control of production processes and the quality control of finished products.