CN117313899B - Method, apparatus and medium for data processing - Google Patents

Method, apparatus and medium for data processing Download PDF

Info

Publication number
CN117313899B
CN117313899B CN202311597853.3A CN202311597853A CN117313899B CN 117313899 B CN117313899 B CN 117313899B CN 202311597853 A CN202311597853 A CN 202311597853A CN 117313899 B CN117313899 B CN 117313899B
Authority
CN
China
Prior art keywords
samples
sample
training
candidate
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311597853.3A
Other languages
Chinese (zh)
Other versions
CN117313899A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Manufacturing EDA Co Ltd
Original Assignee
Advanced Manufacturing EDA Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Manufacturing EDA Co Ltd filed Critical Advanced Manufacturing EDA Co Ltd
Priority to CN202311597853.3A priority Critical patent/CN117313899B/en
Publication of CN117313899A publication Critical patent/CN117313899A/en
Application granted granted Critical
Publication of CN117313899B publication Critical patent/CN117313899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

Methods, devices, and media for data processing are provided according to example embodiments of the present disclosure. In the method, a training sample set for a machine learning model is obtained. The training samples are associated with at least one of a design or a measurement of the integrated circuit. The method further includes generating an abnormal sample detection result based on the respective labels and the respective features of the training samples, which is indicative of at least a plurality of candidate samples in the training sample set, each candidate sample being a candidate for the abnormal sample having the wrong label. The method further includes determining one or more target samples from the plurality of candidate samples for use in verifying whether there are false labels based on the abnormal sample detection results. The method further includes performing an update operation associated with the training sample set based on the respective verification results for the target samples. In this way, abnormal samples in the training sample set may be reduced, thereby improving the training effect of the machine learning model.

Description

Method, apparatus and medium for data processing
Technical Field
Embodiments of the present disclosure relate generally to the field of data processing and, more particularly, relate to methods, apparatuses, and media for data processing.
Background
The efficiency of problem solving can be greatly improved by utilizing a machine learning model for prediction and analysis. For machine learning models, there are many parameters that need to be configured, such as the model parameters at the bottom, the way the data is processed, etc. After model training is completed, the results also need to be analyzed to achieve the desired model performance. Currently, machine learning models have been applied to a number of links in the design, fabrication, etc. of integrated circuits. Before applying the machine learning model to these links related to the integrated circuit or in order to update the applied machine learning model, the machine learning model needs to be trained with data related to the integrated circuit. Training data affects the performance of the machine learning model.
Disclosure of Invention
In a first aspect of the present disclosure, a method for data processing is provided. The method comprises the following steps: obtaining a set of training samples for the machine learning model, each training sample in the set of training samples being associated with at least one of a design or a measurement of the integrated circuit; generating an abnormal sample detection result based on the corresponding labels and the corresponding features of the training samples in the training sample set, wherein the abnormal sample detection result at least indicates a plurality of candidate samples in the training sample set, and each candidate sample is a candidate of the abnormal sample with the error label; determining one or more target samples from the plurality of candidate samples based on the abnormal sample detection result, for verifying whether the one or more target samples have an error tag; and performing an update operation associated with the training sample set based on respective verification results for the one or more target samples.
In a second aspect of the present disclosure, an electronic device is provided. The electronic device includes a processing unit, and a memory coupled to the processing unit. The memory has instructions stored therein which, when executed by the processing unit, cause the electronic device to perform a method for data processing according to the first aspect of the present disclosure.
In a third aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has a computer program stored thereon. The computer program, when executed by a processor, implements a method for data processing according to the first aspect of the present disclosure.
According to the embodiment of the disclosure, abnormal samples in the training sample set can be reduced, so that the training effect of the machine learning model is improved. Thus, embodiments of the present disclosure help reduce the difficulty and complexity of using machine learning models for integrated circuit related users.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals designate like or similar elements, and wherein:
FIG. 1A illustrates an example flow using a machine learning model;
FIG. 1B illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;
FIG. 2 illustrates a schematic diagram of one example of an architecture for data processing, according to some embodiments of the present disclosure;
FIG. 3 illustrates a flowchart of one example of a method for data processing according to some embodiments of the present disclosure;
FIG. 4 illustrates a schematic diagram of an example data processing process including abnormal sample detection and verification, according to some embodiments of the present disclosure;
FIG. 5 illustrates a schematic diagram of calculating anomaly scores according to some embodiments of the present disclosure;
FIG. 6 illustrates a schematic diagram of selecting a target sample for verification in accordance with some embodiments of the present disclosure;
FIG. 7 illustrates a flow chart of a method for data processing according to some embodiments of the present disclosure;
FIG. 8 illustrates a flow chart of another method for data processing according to some embodiments of the present disclosure; and
Fig. 9 illustrates a block diagram of an electronic device in which one or more embodiments of the disclosure may be implemented.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "network," and these terms are used interchangeably herein. A model may in turn comprise different types of processing units or networks.
As briefly mentioned above, machine learning models have been applied to a number of links in the design and manufacture of integrated circuits. In order to apply the machine learning model, it is necessary to perform parameter configuration and analyze the effect of training or the like. FIG. 1A illustrates an overall example flow using a machine learning model. In the example of fig. 1A, a machine learning model is used for defect detection in integrated circuit fabrication. In particular, the Defect Management System (DMS) 102 may acquire data from integrated circuit production and test lines, such as images of wafers containing defects or not (such as scanning tunneling electron microscopy images). Thus, data 101 for training and verification of the machine learning model may be obtained. The data 101 is further divided into training data and verification data. The training data is used for model training performed at block 103. At block 104, the trained model is validated using the validation data to obtain the performance of the model. At block 105, the performance is checked to determine if the performance meets the requirements. If the inspection is not passed, i.e. the performance is not satisfactory, the model may be adapted and/or the data 101 inspected. If the inspection is performed, i.e., the performance meets the requirements, then model deployment is performed at block 106.
During model deployment, the wafer image may be processed with the model to detect whether there are defects therein, the type of defects, and the like. During deployment, DMS 102 may monitor the performance of the model. The performance of the manufacturing equipment of integrated circuits may vary over time, resulting in the resulting defects also varying over time. Accumulation of such changes over time may cause the model performance to drop to an unacceptable level, thereby triggering retraining of the model, i.e., the process described above is repeated.
From the above description, it can be seen that there are a number of different phases when using machine learning models in the integrated circuit field. A user familiar with integrated circuits may not have the parameter configuration or result analysis capabilities of the machine learning model. For different problems that occur at different stages, the user may not have a ready end, thereby affecting the performance of subsequent tasks. For example, training data for machine learning models is typically provided by a user, but the user may not be able to determine whether the provided training data is suitable for training. After model training is completed, if the model performance does not reach the expected level, the user does not know the main cause of poor training effect. On the other hand, the developer of the machine learning model does not generally have expertise of the integrated circuit, and thus it is also difficult to determine whether the training data provided by the user is suitable for training. This discrepancy between the knowledge of these two domains increases the difficulty of using the machine learning model by users in the integrated circuit domain and even presents a nuisance to the users.
To this end, embodiments of the present disclosure propose a method for data processing. According to an embodiment of the present disclosure, a training sample set for a machine learning model is obtained. Each training sample in the training sample set is associated with a design and/or measurement of the integrated circuit. Further, based on the respective labels and the respective features of the training samples, an abnormal sample detection result is generated that is indicative of at least a plurality of candidate samples in the training sample set, each candidate sample being a candidate for an abnormal sample having a false label. Further, based on the abnormal sample detection result, one or more target samples are determined from the plurality of candidate samples for verifying whether there is an erroneous label. Based on the respective verification results for the target samples, an update operation associated with the training sample set is performed. In this way, abnormal samples in the training sample set may be reduced, thereby improving the training effect of the machine learning model. Thus, embodiments of the present disclosure help reduce the difficulty and complexity of using machine learning models for integrated circuit related users.
Various example implementations of this scheme will be described in detail below with reference to the accompanying drawings.
Referring first to FIG. 1B, a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented is shown. The example environment 100 may generally include a data processing device 110. In some embodiments, data processing device 110 may be a computing-enabled device such as a personal computer, workstation, server, or the like. The scope of the present disclosure is not limited in this respect.
The data processing device 110 takes as input a training sample set 120. The training sample set 120 includes a plurality of training samples for training a machine learning model 130 (which may also be referred to simply as a model). The training samples may include various suitable types of data or combinations of types of data, such as numeric class data, images, and combinations thereof. In embodiments of the present disclosure, a training sample is a sample associated with an integrated circuit that may include data for any part of the design, manufacture, etc. of the integrated circuit.
In some embodiments, each training sample in training sample set 120 may be associated with a design of an integrated circuit. The design schemes include, for example, layout design schemes, process design schemes, quality inspection design schemes, and the like. Reasonable training samples may effectively enhance the performance of the machine learning model 130. Thus, training samples associated with a design solution are tested to find if there are factors that degrade the performance of the model, which is necessary during the design phase of the solution.
In some embodiments, the training samples may include a circuit layout. A circuit layout (simply referred to as a layout) is a series of geometric figures converted from a designed and simulated optimized circuit, and contains physical information data related to devices such as integrated circuit dimensions, topology definitions of various layers and the like. Providing the training sample set 120 including the layout to the machine learning model 130 may enable detection of hotspots (hotspots), such as broken wires (pins) or bridges (bridges), in the integrated circuit layout using the machine learning model 130. For example, the machine learning model 130 may be trained with sample layouts that do not have hot spots (also referred to as positive sample layouts) and sample layouts that include hot spots (also referred to as negative sample layouts). The trained machine learning model 130 may be used to detect hotspots in the layout.
Additionally or alternatively, in some embodiments, each training sample in training sample set 120 may be associated with a measurement of an integrated circuit. The measurement results include, for example, photographed images of the manufactured chips or measured values. In such embodiments, the machine learning model 130 may be configured to process the measurements to achieve the desired task, such as deriving parameter values or classifying, or the like. As an example, the machine learning model 130 may be used to detect whether there is a defect in the captured image or whether the parameter value satisfies the requirement, etc., thereby guiding the scheme design or the process manufacturing, etc.
In some embodiments, the training sample may include an image of the manufactured wafer, i.e., a captured image. A wafer is a silicon wafer used in the fabrication of silicon semiconductor integrated circuits, and may also be referred to as a wafer or a silicon wafer. The wafer can obtain a chip finished product after a series of complex procedures such as photoetching, transistor manufacturing, cutting, testing, packaging and the like. As an example, providing an image including the manufactured wafer to the machine learning model 130 may enable detection of defects in the wafer using the machine learning model 130. Such images include, but are not limited to, optical microscopy images, scanning electron microscopy images, and the like. For example, the machine learning model 130 may be trained with sample images that include defects of different shapes, sizes, colors, locations, densities, and the like. The machine learning model 130 so trained may be used to identify and locate defects in the wafer. As another example, during photolithography, there is some distortion and deviation between the photolithographic pattern and the layout pattern on the wafer due to diffraction and interference phenomena of light. Such deviations directly affect the performance and production yield of the integrated circuit. The machine learning model 130 may be used to detect deviations between the lithography pattern and the layout pattern.
In some embodiments, the training samples may include manufacturing operation data or semiconductor quality data. For example, the training samples may include measurement data obtained from a wafer fabrication line. As another example, the training samples may include data obtained from a wafer fabrication line with labels that are acceptable/unacceptable for measurement.
It should be understood that the above list only examples of training samples and is not intended to be limiting in any way. In embodiments of the present disclosure, training samples may include data related to various links of the design, fabrication, testing, etc. of an integrated circuit.
The data processing device 110 is configured to detect the training sample set 120 and generate a detection result. As will be described in further detail below in connection with fig. 2-4.
Generally, machine learning may generally include three phases, namely a training phase, a testing phase, and an reasoning phase. In the training phase, a given model may be trained using a large number of training samples, and iterated until the model is able to obtain consistent inferences from the training samples that meet the desired goal. By training, the model may be considered to be able to learn the correlation from input to output from the training samples. In the test phase, test inputs are applied to the trained model to test whether the model is capable of providing the correct outputs, thereby determining the performance of the model. In the inference phase, the model may be used to process the actual input based on the trained parameter values, determining the corresponding output. Embodiments of the present disclosure may detect training sample set 120 of machine learning model 130 at one or more stages of machine learning, e.g., automatically or designated, to minimize factors that degrade model performance and generate detection results to help cross-domain users utilize models or locate problems that affect model performance.
FIG. 2 illustrates a schematic diagram of an example architecture 200 for data processing, according to some embodiments of the present disclosure. In some embodiments, the example architecture 200 may be included in or implemented by a data processing device 110 as shown in FIG. 1B. It should be understood that example architecture 200 may also include additional modules not shown and/or may omit certain module(s) shown, the scope of the present disclosure being not limited in this respect.
The example architecture 200 includes at least a factor detection module 210 and a detection result generation module 220. The factor detection module 210 is configured to detect factors in the training sample set 120 that reduce the performance of the machine learning model 130. The detected factors that reduce the performance of the machine learning model 130 are also referred to herein as latent factors or target factors. The potential factors may relate to individual training samples, groups of training samples, labels of training samples, and the like, various factors that may adversely or negatively impact the training effect of the machine learning model 130. The detection result generation module 220 generates a sample detection result based on the factors detected by the factor detection module 210. The sample detection results indicate at least recommended processing for the training sample set 120.
The inputs to the factor detection module 210 include a training sample set 120 for the machine learning model 130. The training samples in training sample set 120 are as described above with reference to fig. 1B. Factor detection module 210 may detect factors in training sample set 120 that reduce the performance of machine learning model 130 based on respective labels and/or respective features of the training samples. Depending on the business scenario utilizing the machine learning model 130, it may be determined whether the problem to be solved is a classification problem or a regression problem. For example, for a classification problem, the output of the machine learning model 130 is a discrete numerical type to represent the class. Thus, the labels of the corresponding training samples may be categories. For another example, for regression problems, the output of the machine learning model 130 is a continuous numerical type to represent a predicted value. Thus, the label of the corresponding training sample may be a numerical value. Further, the features of the training samples may be a vectorized representation of the training samples, which may be determined using any suitable feature extraction model or embedding (embedding) pattern, for example. Factor detection module 210 may determine whether there are factors of imbalance, missing values, outliers, etc. in training sample set 120 based on the labels or features of the training samples.
Potential factors that may reduce the training effect of the machine learning model 130 may include sample imbalance factors. The sample imbalance factor may refer to a lower proportion of a set of training samples (e.g., samples of the same class) in training sample set 120, such as a lower threshold proportion. To detect such sample imbalance factors, in some embodiments, factor detection module 210 may include a sample imbalance detection unit 212. As an example, the various categories to be determined by the machine learning model for the classification task may appear uniformly in practice. In this case, using an unbalanced training sample set during the training phase would reduce the performance of the machine learning model. The sample imbalance detection unit 212 may be configured to detect whether the training sample set 120 has a class imbalance problem. For example, the distribution of training samples may be counted by tag. If the number of training samples differs significantly between different classes or the number of training samples differs significantly from range of values, a problem of sample imbalance may occur.
In some embodiments, the sample imbalance detection unit 212 may divide the training samples into multiple sets of training samples based on the labels of the training samples. Training samples in the same set have matching labels. In the case where the tags are of a discrete type, the matched tags refer to the same tag. For example, if the tags are of the same category, the training samples in the same group are of the same category. In the case where the tags are of a continuous type, the matched tags are those having similar index values (e.g., a difference less than a threshold). For example, if the labels are numerical values, the difference between the label values of training samples in the same group is small. The value used as a label may be regarded as a true value (ground truth) which may be obtained in any suitable way, for example as measured in practice or as noted manually beforehand. Further, the sample imbalance detection unit 212 may determine the proportion of each set of training samples in the training sample set 120. If the proportion of a certain set of training samples is less than the threshold proportion, the sample imbalance detection unit 212 determines that a sample imbalance factor is detected. In such embodiments, each set of training samples has an explicit or explicit common or similar attribute, such as being of the same class or having similar values.
In some embodiments, the sample imbalance detection unit 212 may cluster training samples based on features of the respective training samples, and each cluster obtained by the clustering may be regarded as a group. Training samples in the same group have similar features. If the proportion of training samples in a group in training sample set 120 is less than a threshold proportion, then sample imbalance detection unit 212 determines that a sample imbalance factor is detected. This is an implicit grouping scheme, as compared to the tag-based grouping embodiment described above. In such embodiments, each set of training samples may not have explicit common or similar attributes.
If the sample imbalance factor is detected, the detection result generation module 220 may generate the first recommendation information based on the detected sample imbalance factor accordingly. The first recommendation information may provide operational advice regarding processing training sample set 120 as part of the sample detection results, particularly how to process training samples with few samples. If the proportion of training samples in a group in training sample set 120 is less than a threshold proportion, the first recommendation information may indicate that the number or proportion of samples of the group of training samples is increased. In some embodiments, in addition to the first recommendation information, reference information regarding sample imbalance may be generated as part of the sample detection result. For example, the reference information may include an indication of the class imbalance ratio, the degree of imbalance (such as high, medium, low, etc.).
As one example, training sample set 120 includes training samples having a class a and a class B. For example, a sample of category A may be a layout with hot spots, while a sample of category B may be a layout without hot spots. After grouping by category, group a includes training samples with category a and group B includes training samples with category B. The proportion of training samples in group A in training sample set 120 is 1/10, while the proportion of training samples in group B in training sample set 120 is 9/10. If the threshold ratio is 1/5, the ratio of packet a is less than the threshold ratio, the sample imbalance detection unit 212 determines that a sample imbalance factor is detected. Accordingly, the detection result generation module 220 generates the first recommendation information. Such first recommendation information may indicate that the number of training samples of group a is increased to exceed a threshold proportion or until equalized with the number of training samples of group B.
In some embodiments, the first recommendation information may also indicate a specific operational recommendation that reduces or eliminates sample imbalance factors. For example, data enhancement operations may be indicated based on the distribution of training samples, such as employing random undersampling, random oversampling, cluster-based undersampling, composite data enhancement, and so forth.
Potential factors that may reduce the training effect of the machine learning model 130 may include outlier samples. To this end, in some embodiments, factor detection module 210 includes an abnormal sample detection unit 214. The outlier samples in the training sample set 120 may be considered outliers, or orphans, which may have outlier labels, outlier values, or outlier features, etc. For example, the labels or features of the abnormal samples are inconsistent with most training samples, and exhibit "abnormal" characteristics. The abnormal sample detection unit 214 may be configured to detect whether an abnormal sample exists in the training sample set 120. For example, abnormal sample detection may be based on a distribution of training samples, differences between different samples, or clustering, etc.
In some embodiments, the abnormal sample detection unit 214 may perform feature extraction on each training sample to obtain corresponding features of each training sample. As an example, if the training samples are designs such as circuit layouts, neural networks (e.g., convolutional neural networks, recurrent neural networks) may be utilized to generate a feature representation of the training samples. In such examples, the utilized neural network has been trained to extract features from a design scheme, such as a circuit layout, using any suitable means. As another example, if the training samples are numerical measurements, an encoder may be utilized to convert the measurements into a vectorized representation (e.g., vector). For example, the encoder may map values in the measurement results into a predefined feature space, thereby generating a vectorized representation of the measurement results as its features. The individual training samples may then be clustered based on the features. Training samples that cannot be clustered may be determined to be outlier samples.
In some embodiments, abnormal sample detection may be performed using the labels of the training samples. The abnormal sample detection unit 214 may determine a set of training samples having matching labels based on the labels of the individual training samples in the training sample set 120. Further, the abnormal sample detection unit 214 may determine a characteristic difference between each pair of training samples in the set of training samples. For example, the feature extraction model may be utilized to generate features for each training sample, and then calculate the feature differences for each pair of training samples. Still further, the abnormal sample detection unit 214 may determine an abnormal sample from the set of training samples based on the feature differences exceeding the threshold differences. For example, a training sample may be detected as an outlier if the feature differences between the training sample and a number of other training samples in the same set all exceed a threshold difference. In general, samples with matching tags typically have similar characteristics, whereby abnormal samples can be more accurately determined using the tags and characteristics of the training samples.
Illustratively, sample a has the same label as some training samples and is divided into the same set of training samples. The abnormal sample detection unit 214 determines a feature difference between each pair of training samples in the set of training samples. If the characteristic difference between the sample a and the other samples exceeding the threshold number exceeds the threshold difference, the sample a may be detected as an abnormal sample, and the abnormal sample detection unit 214 determines that an abnormal sample factor is detected.
In some scenarios, the abnormal sample detection unit 214 has the capability of detecting hot spot layout, for example, hot spot detection may be performed by means of a plurality of types of hot spot layout input in advance, by means of graph comparison or neural network identification, and the like. In some cases, the labels of a certain training sample or samples may be wrong, e.g., a layout without a hotspot is incorrectly labeled as having a hotspot. In this case, the training samples with the wrong label may be greatly different from other samples in the same group, and thus be detected as abnormal samples. In this way, samples of label errors can be advantageously identified.
In some embodiments, the abnormal sample detection unit 214 may group individual training samples based on their characteristics, each group of training samples having matching or similar characteristics. In general, most or all of the training samples that are partitioned into the same group have matching labels. If a training sample has a non-matching tag with a majority of training samples in the same set, the training sample may be detected as an outlier sample. For example, the abnormal sample detection unit 214 may cluster based on the features of the respective training samples. Theoretically, training samples with matching labels would be clustered into the same cluster. If one or some of the training samples has a non-matching tag with other training samples in the same cluster, the abnormal sample detection unit 214 determines that it is an abnormal sample.
If an abnormal sample is detected, the detection result generation module 220 may generate the second recommendation information based on the detection of the abnormal sample. The second recommendation information may provide an operational suggestion for the abnormal sample as part of the sample detection result. For example, the second recommendation information may indicate that the abnormal sample is removed or corrected from the training sample set 120.
Additionally, in some embodiments, the degree or likelihood of being detected as an abnormal sample anomaly may also be determined, such as an anomaly score described below, that indicates the probability that the sample has an erroneous signature. In this case, the second recommendation information may further depend on the degree or likelihood of abnormality of the sample. For example, for training samples that are highly likely to be abnormal samples (e.g., have a high anomaly score), the second recommendation information may indicate that the training sample is removed. In contrast, for training samples that are less likely to be outlier samples (e.g., have a lower outlier score), the second recommendation information may indicate to correct the training sample. In such an embodiment, a refinement of the outlier samples is implemented to help the user improve the training data better. This is advantageous for further reducing the difficulty and complexity of use for the user.
In some embodiments, if an abnormal sample is detected, the sample detection result generated by the detection result generation module 220 may include description information of the abnormal sample. The descriptive information may include information in the form of images, values, text, etc. For example, a list of abnormal samples may be generated, which may include, for each abnormal sample, information related to the abnormal sample, such as a sequence number, a likelihood of being an abnormal sample, and so forth. Alternatively or additionally, the sample detection result may comprise descriptive information of at least one "normal" training sample of the same group as the abnormal sample.
Illustratively, each training sample may be a circuit layout. The sample detection result may include the circuit layout detected as an abnormal sample and the likelihood of it being abnormal. The sample detection result may also include circuit layouts in the same group that are not detected as abnormal samples (i.e., normal samples). In this way, the difference of the abnormal sample from the normal sample can be presented to the user in an intuitive manner, thereby facilitating the user to correct the detected abnormal sample or potential abnormal sample, or the like.
Elements that may be included in factor detection module 210 are described above by way of a number of example embodiments. It should be understood that the types and numbers of units contained in the factor detection module 210 are merely exemplary and are not limiting of the present disclosure. For example, the factor detection module 210 may include a sample imbalance detection unit 212, an abnormal sample detection unit 214, and any other suitable detection units, thereby minimizing factors that degrade model performance. For another example, the factor detection module 210 may further include a unit for missing value detection, so that the completion training samples may be indicated by the detection result generation module 220. For another example, the factor detection module 210 may further include a plurality of sample anomaly detection units 214. Each abnormal sample detection unit 214 adopts different detection algorithms, thereby improving the accuracy and efficiency of abnormal sample detection.
As shown in fig. 2, in some embodiments, architecture 200 may also include a verification module 230. The verification module 230 is configured to determine a factor for further verification, that is, a factor for verifying whether the detection result is correct, from the detected factors. Such verification may be performed manually, such as Expert verification. In some embodiments, through such verification, processing recommendations for detected factors may also be obtained, such as altering labels of abnormal samples, enhancing samples with some sort of labels, and the like.
The verification module 230 is further configured to provide feedback to the detection result generation module 220 and/or the factor detection module 210 based on the verification result. For example, if the verification result indicates that the detection result is correct, the feedback provided is forward; if the verification result indicates that the detection result is incorrect, the feedback provided is negative. In this way, it is possible to facilitate correct detection of factors that reduce the effect of model training, such as abnormal samples, etc.
In addition, detection can be performed in different modes at different stages of machine learning, so that a user can obtain a relatively clear adjustment direction at the corresponding stage, and the model training efficiency is improved. In some embodiments, the sample detection described above may be performed prior to training the machine learning model 130 using the training sample set 120. In some embodiments, sample detection may be performed after the machine learning model 130 is trained. For example, if the performance of the trained machine learning model 130 is below a predetermined performance (which means that training is not good), sample detection may be performed. In particular, in some embodiments, sample detection may be in response to the performance of the trained machine learning model 130 being below a predetermined performance and no sample imbalance or abnormal samples being detected. In order to more clearly understand the sample detection scheme of the embodiments of the present disclosure, a description will be given below of a point of time at which a detection task is performed in conjunction with a specific example.
Fig. 3 illustrates a flowchart of an example method 300 for data processing, according to some embodiments of the present disclosure. For example, the method 300 may be performed by the data processing device 110 as shown in FIG. 1B. The method 300 is described below in connection with fig. 1B. It should be understood that method 300 may also include additional blocks not shown and/or that certain blocks shown may be omitted. The scope of the present disclosure is not limited in this respect.
At block 310, training samples are detected prior to training the machine learning model. Such detection is performed prior to training the machine learning model 130 using the training sample set 120. For example, before training, it may be detected whether the training samples are balanced. If it is determined that a sample imbalance factor is detected, fewer categories of training samples may be listed and suggested for data enhancement. As another example, prior to training, it may be detected whether an abnormal sample is present in the training sample set 120. If an abnormal sample is detected, a list of abnormal samples may be generated. The list may indicate abnormal samples, and may further indicate training samples and their likelihood that are suspected of being abnormal. Further, the data processing device 110 may suggest processing of the abnormal sample or training sample suspected of being abnormal, such as removal from the training sample set 120 or correction thereof.
At block 320, reasoning is performed using the trained machine learning model 130 to determine training effects. For example, in the test phase, machine learning model 130 is utilized to infer test specimens with labels. Based on the reasoning results, training effects, i.e. the performance of the trained machine learning model, can be analyzed.
At block 330, a determination is made as to whether the training effect meets the requirements. The evaluation of the training effect may be overall. For example, for multi-classification tasks, the requirement for training effect or model performance may be that the average accuracy of all classes is greater than a threshold. Alternatively or additionally, the evaluation of the training effect may be further refined. For example, for a multi-classification problem, the requirement for training effect or model performance may be that the accuracy of one or some class is greater than a threshold. If it is determined at block 330 that the training effect meets the requirements, then no post-training sample detection may be performed.
If it is determined at block 330 that the training effect does not meet the requirements, the method 300 proceeds to block 340. At block 340, the training samples are detected and corresponding suggestions are made. Such detection is performed in response to the training effect not meeting the requirements, i.e., the performance of the machine learning model 130 is below a predetermined performance. Analysis can be performed according to the results of the reasoning to determine possible factors that lead to poor training results. For example, after training, the data processing device 110 detects the training sample set 120 based on the user's instructions and determines whether the training samples are balanced based on their distribution. Upon detecting a sample imbalance factor, the data processing device 110 may further determine whether the training sample is data enhanced based on the parameters. If no data enhancement has been performed, a suggestion is made to perform the data enhancement. The data processing device 110 may also detect whether an abnormal sample is present in the training sample set 120. If an abnormal sample is present, removal is suggested.
In summary, embodiments of the present disclosure may utilize different stages of a machine learning model to detect training samples of the model, e.g., automatically or designated, to minimize factors that degrade model performance and generate detection results, thereby helping cross-domain users to utilize the model or locate problems that affect model performance.
Abnormal sample detection is described above with reference to fig. 2. One type of abnormal sample may be a sample with an error tag. Example embodiments of detecting and verifying such anomalous samples with false labels are further described below with reference to fig. 4-6. FIG. 4 illustrates a schematic diagram of an example data processing process 400 including abnormal sample detection and verification, according to some embodiments of the present disclosure. As shown in fig. 4, the training sample set 401 is obtained from a DMS system, but it should be understood that this is merely exemplary and that the training sample set 401 may be obtained in any suitable manner. The training sample set 401 may be regarded as an example of the training sample set 120 described above with reference to fig. 1B, and thus a description thereof will not be repeated.
At block 402, a data analysis is performed on the training sample set 401 to detect sample imbalance factors. If a sample imbalance factor is detected, the training sample set 401 may be enhanced accordingly, for example, to increase the data for the categories for which the original sample is less. Any suitable data enhancement method may be employed, embodiments of the present disclosure are not limited in this respect.
At block 403, the machine learning model may be trained using the enhanced training sample set 401. Thus, a base (base) version of the machine learning model, also referred to as the base model, may be obtained. The performance of the base model may be verified using a set of verification samples. If the performance meets the requirements, deployment of a machine learning model may be performed at block 404. If the performance does not meet the requirements, the process 400 proceeds to an example process 450 of abnormal sample detection and verification.
In general, in process 450, abnormal sample detection results may be generated based on respective labels and respective features of training samples in training sample set 401. The abnormal sample detection result indicates at least a plurality of candidate samples in the training sample set 401, each candidate sample being a candidate for an abnormal sample with a false label. That is, candidate samples that may be abnormal samples may be detected based on the respective labels and the respective features of the training samples.
In some embodiments, the abnormal sample detection results may include respective abnormal scores for the candidate samples, the abnormal score for each candidate sample indicating a probability that the candidate sample has an erroneous signature. In other words, the anomaly score may indicate a likelihood that the corresponding candidate sample is an anomalous sample. In such an embodiment, as shown in fig. 4, the detection of the candidate sample may include sample scoring performed at block 451 and sample screening performed at block 452.
Sample scoring may be based on respective features and respective labels of the training samples. In particular, for any one of the training samples in the training sample set 401 (also referred to as a given training sample), a set of similar samples for the given training sample and the respective similarities for the set of similar samples to the given training sample may be determined from the training sample set 401 based on the respective features of the respective training samples in the training sample set 401. For example, features of each training sample may be extracted using any suitable feature extraction model. Then, for any training sample under consideration, K training samples closest to the training sample are found in the feature space as similar samples of the training sample, wherein K is a positive integer greater than or equal to 1. For example, K Nearest Neighbor (KNN) algorithms may be utilized to determine similar samples. That is, the set of similar samples is the K training samples with features closest to the training sample.
The anomaly score for a given training sample may then be determined based on the labels of the given training sample, the respective labels of the set of similar samples, and the respective similarities of the similar samples to the given training sample. As mentioned previously, the anomaly score indicates the probability that a given training sample has a false label. There may be similar samples in the set of similar samples whose labels match the labels of the given training sample and similar samples whose labels do not match the labels of the given training sample. In calculating the anomaly score, the contribution of the similar samples of the tag matches and the similar samples of the tag mismatches to the anomaly score may be different.
In some embodiments, similar samples that are label matched and similar samples that are label unmatched may be considered in calculating the anomaly score, but with different weights. Specifically, for each similar sample, a weight for the similar sample may be determined based on whether the label of a given training sample matches the label of the similar sample. Thus, an anomaly score may be determined based on the respective similarity and weights of the set of similar samples. It will be appreciated that the more similar samples that a tag matches, the less likely that the given training sample is an outlier. In view of this, the contribution of a similar sample of tag matches to the anomaly score is negative compared to a similar sample of tag mismatches.
An example of calculating the anomaly score is described with reference to fig. 5. As shown in fig. 5, a given training sample 501 has a label a. Using an algorithm such as KNN, the 5 similar samples 511, 512, 513, 514, and 515 closest to the characteristics of the given training sample 501 are found from the training sample set. Similar samples 511, 512, 513, 514 have tag a, while similar sample 515 has tag B. By way of example, the samples may be images of a wafer, such as scanning electron microscope images. The label may be of the type of defect in the wafer.
From the characteristics of the given training sample 501 and the corresponding characteristics of the similar samples 511, 512, 513, 514, and 515, the corresponding similarities of the similar samples 511, 512, 513, 514, and 515 to the given training sample 501, i.e., similarity 1, similarity 2, similarity 3, similarity 4, and similarity 5, as shown in fig. 5, may be calculated. Similar samples 511, 512, 513, 514 having the same label as a given training sample 501 have a weight W1, while similar sample 515 having a different label has a weight W2. Considering the negative contribution of the label matching samples, (1-similarity 1), (1-similarity 2), (1-similarity 3), (1-similarity 4) and similarity 5 may be weighted by weights W1 and W2 to yield an anomaly score 520 for a given training sample 501.
In some embodiments, similar samples of tag mismatch may be considered in calculating the anomaly score, rather than similar samples of tag match. In particular, one or more similar samples from the set of similar samples may be selected for which the tag does not match a given training sample. The anomaly score may then be determined based on the respective similarities of the one or more similar samples for which the tags do not match to the given training sample. Illustratively, the anomaly score may be calculated by:
(1)
Wherein the method comprises the steps ofRepresenting an anomaly score for a given training sample currently under consideration;krepresenting the number of similar samples; i represents the firsti similar samples; />Representing the similarity of the ith similar sample to the given training sample, which may be, for example, a normalized distance between the features of the ith similar sample and the features of the given training sample; />Representing the weight of the ith training sample. For example, if the label of the ith similar sample matches a given training sample, +.>May be 0; if the label of the i-th similar sample does not match the given training sample +.>May be 1.
In the embodiment described above, similar samples are determined based on the features, and then anomaly scores are calculated based on the labels of the similar samples. Alternatively or additionally, in some embodiments, a set of training samples with matching labels may be determined based on the labels of the individual training samples in training sample set 401. Further, a feature difference between each pair of training samples in the set of training samples may be determined, and then an outlier sample may be determined from the set of training samples based on the feature difference exceeding a threshold difference. For example, a training sample may be detected as an outlier if the feature differences between the training sample and a number of other training samples in the same set all exceed a threshold difference. Such an embodiment is described above with reference to the abnormal sample detection unit 214, and thus will not be described in detail.
With continued reference to fig. 4. After scoring the samples passing through block 451, anomaly scores for each training sample in training sample set 401 may be obtained. At block 452, the training samples are screened based on the anomaly score to determine a plurality of candidate samples, also referred to as a candidate sample set, that are candidates for the anomaly sample. For example, training samples with anomaly scores exceeding a threshold score may be determined as candidate samples. For another example, training samples may be ordered by anomaly score and a number or proportion of training samples determined to be candidate samples.
At block 453, detection result verification is performed. For example, one or more target samples may be determined from the candidate samples based on the abnormal sample detection results for verifying whether the one or more target samples have an error tag. Information about one or more target samples may then be presented for manual (e.g., expert) verification, and verification results received thereby. The verification result may indicate whether the label of the target sample is wrong, i.e. whether the target sample is an abnormal sample. In some embodiments, the verification result may also indicate a new tag for the target sample.
Example embodiments of determining a target sample for verification from candidate samples are described below. It will be appreciated that training samples may be numerous and that the amount of manual verification is limited, so that a sample to be verified needs to be selected. In some embodiments, the target samples may be randomly selected from among the candidate samples. In some embodiments, the target samples may be selected from high to low by anomaly score.
In some embodiments, candidate samples may be grouped and then target sample selection may be performed, taking into account possible inaccuracy in the calculation of the anomaly score. For example, candidate samples may be partitioned into multiple sets of candidate samples based on their respective anomaly scores. For each set of candidate samples, a selection probability for the set of candidate samples is determined based on the anomaly scores for the set of candidate samples to obtain a corresponding selection probability for each set of candidate samples. Then, a target sample may be selected from the candidate samples according to the respective selection probabilities of the candidate samples of each group.
An example of selecting a target sample for verification is described with reference to fig. 6. As shown in fig. 6, the candidate samples in the candidate sample set 601 are divided into h groups of candidate samples, which are denoted by g1, g2, g3, … …, gh, respectively, according to a grouping criterion 602. The grouping criteria 602 relates to the total number of candidate sample groups and the anomaly score of the candidate samples. Illustratively, the grouping criteria 602 may be as follows:
(2)
Wherein the method comprises the steps ofRepresenting the anomaly score of the candidate sample,hrepresenting the total number of candidate sample groups partitioned,irepresent the firstiA candidate sample set. That is, the abnormality score is in the range +.>The candidate samples within may be divided into the firstiAnd (3) a group candidate sample. It should be understood that the grouping criteria described herein are merely exemplary and are not intended to limit the scope of the present disclosure.
Next, the probability of selection for each set of candidate samples may be calculated. As shown in fig. 6, the respective selection probabilities of the h sets of candidate samples are denoted by p1, p2, p3, … …, ph, respectively. Illustratively, the probability of selection pi of the i-th set of candidate samples may be calculated by equation (3) and satisfy equation (4):
wherein the method comprises the steps ofSample number representing i-th set of candidate samples, +.>Representing the anomaly score for the jth candidate sample in the ith set of candidate samples, N represents the total number of candidate samples in the candidate sample set 601,representing the anomaly score for the mth candidate sample in the set of candidate samples 601. The sum of the selection probabilities for the h sets of candidate samples is 1, as described in equation (4).
Next, a target sample 603 may be selected from the candidate samples according to the calculated selection probability. Information related to the target sample 603 may then be presented for manual verification. Thus, the verification result 604 may be received.
In some embodiments, a predetermined number or a predetermined proportion of target samples may be selected at a time according to the selection probabilities calculated above.
In some embodiments, multiple rounds of selection may be performed, with a predetermined number or predetermined proportion of target samples being selected in each round of selection. The selection probability calculated above may be taken as the probability used for the first round of selection. The remaining candidate samples may be determined in subsequent rounds of selection by removing the previously selected target samples. The probability of selection of each set of candidate samples may be updated based on the verification results of the target samples selected from the previous round.
For example, a feedback score for a candidate sample group to which a target sample belongs may be generated from a verification result of the target sample. The selection probability may then be updated with the feedback score. For example, the selection probability may be updated as follows:
(5)
wherein the method comprises the steps ofRepresenting the updated selection probability for the i-th set of candidate samples. If the j-th candidate sample was selected as the target sample in the previous round, then rewards_j represents the feedback score for that candidate sample. Illustratively, the feedback score is at least related to whether the sample is an anomalous sample. The remaining symbols in the formula (5) are the same as those in the formula (3), and thus, description thereof will be omitted.
In some embodiments, the probability of selection of the first set of candidate samples to which the first target sample belongs may be increased if the verification result of the first target sample selected in the previous round of selection indicates that the first target sample has an error tag. That is, if an existing sample in a set of candidate samples is verified as a label error, it is desirable to select a target sample from the set of candidate samples with a greater probability for verification.
In some embodiments, if the verification result of the second target sample selected in the previous round of selection indicates that the second target sample does not have an error tag, the probability of selection of the second set of candidate samples to which the second target sample belongs may be reduced. That is, if an existing sample in a set of candidate samples is verified as being a label that is correct, it is desirable to select a target sample from the set of candidate samples with a smaller probability for verification.
In a multi-round selection embodiment, any suitable condition may be used to terminate the selection iteration described above. For example, the termination condition may be that the number of samples that have been validated exceeds a threshold number. As another example, the termination condition may be that a threshold number of rounds of selection have been made. For another example, the termination condition may be that selected target samples in a certain round are all verified as not having an error tag.
With continued reference to fig. 4. After verification of the detection results, feedback for the training sample set 401 may be determined based on the verification results 604, thereby performing an update operation associated with the training sample set 401. In some embodiments, if the verification result 604 indicates the correct label for the sample with the label error, the feedback may be to update the label for that sample in the training sample set 401. Thus, training sample set 401 may be updated. In some embodiments, if the validation result 604 indicates that a sample tag is incorrect but a correct tag is not provided, the feedback may be to remove the sample from the training sample set 401.
The manually verified sample size is limited and not every candidate sample may be verified. If a greater number of samples in each set of candidate samples are verified as being in label error, then further consideration is required as to how other unverified samples in the set of candidate samples should be processed. In view of this, in some embodiments, if candidate samples are grouped, such as described above with reference to fig. 6, the number of target samples that are verified as having an error tag and belonging to the group of candidate samples may be determined for each group of candidate samples based on the verification result. If the number meets a preset condition, the set of candidate samples may be removed from the training sample set 401 to update the training sample set 401, or the contribution of the set of candidate samples in the training of the machine learning model may be reduced. The preset condition may include exceeding a threshold number, or the like. In some embodiments, reducing the contribution of the set of candidate samples in the training of the machine learning model may include reducing the weight of the training loss term corresponding to the set of candidate samples in the total training loss of the machine learning model.
After determining the update operations associated with training sample set 401, machine learning model adjustments are performed at block 455. For example, the machine learning model is trained using the updated training sample set 401. At block 456, sample verification is performed on the updated training sample set 401. For example, it may be determined whether the proportion of candidate samples in the updated training sample set 401 that are abnormal samples exceeds a threshold. If validated (e.g., the ratio is below a threshold), then it may proceed to block 404 to deploy a machine learning model trained using the updated training sample set 401. If verification is not passed (e.g., the ratio is above a threshold), process 450 may be repeated.
Fig. 7 illustrates a flowchart of an example method 700 for data processing according to an embodiment of the disclosure. For example, method 700 may be performed by data processing device 110 as shown in FIG. 1B. The method 700 is described below in conjunction with fig. 1B. It should be appreciated that method 700 may also include additional blocks not shown and/or that certain blocks shown may be omitted. The scope of the present disclosure is not limited in this respect.
At block 710, a set of training samples for a machine learning model is obtained, each training sample in the set of training samples being associated with at least one of a design or a measurement of an integrated circuit.
At block 720, factors in the training sample set that reduce performance of the machine learning model are detected based on at least one of the respective labels or the respective features of the training samples.
At block 730, sample detection results are generated based on the detected factors, the sample detection results indicating at least a recommended treatment for the training sample set.
In some embodiments, generating a sample test result based on the test result comprises: in response to detecting the sample imbalance and the sample imbalance factor indicating that a proportion of a first set of training samples in the training sample set is less than a threshold proportion, generating first recommendation information as at least a portion of the sample detection result, the first recommendation information indicating an increase in a number of samples of the first set of training samples.
In some embodiments, detecting factors in the training sample set that reduce performance of the machine learning model includes: dividing training samples in a training sample set into a plurality of groups of training samples based on corresponding labels, wherein the training samples in the same group have matched labels; determining the proportion of each group of training samples in the training sample set aiming at each group of training samples in the plurality of groups of training samples; and in response to the proportion of the first set of training samples being less than the threshold proportion, determining that a sample imbalance factor is detected.
In some embodiments, generating a sample test result based on the test result comprises: in response to detecting the presence of an abnormal sample in the training sample set, generating second recommendation information as at least a portion of the sample detection results, the second recommendation information indicating removal or correction of the abnormal sample, wherein a difference between a characteristic of the abnormal sample and a characteristic of a plurality of training samples in the training sample set is greater than a first threshold difference.
In some embodiments, detecting factors in the training sample set that reduce performance of the machine learning model includes: determining a second set of training samples in the training sample set having matched tags based on the corresponding tags; for each pair of training samples in the second set of training samples, determining a feature difference between the pair of training samples based on respective features of the pair of training samples to determine a plurality of feature differences; and determining an abnormal sample from the second set of training samples based on the feature differences exceeding the first threshold difference in response to the plurality of feature differences having feature differences exceeding the first threshold difference.
In some embodiments, the sample detection result further comprises at least one of: descriptive information of an abnormal sample, or descriptive information of at least one training sample of the plurality of training samples.
In some embodiments, the detection of the factor is performed prior to training the machine learning model using the training sample set.
In some embodiments, the detection of the factor is in response to a performance of a machine learning model trained using the training sample set being below a predetermined performance.
In some embodiments, the machine learning model is configured to detect hot spots in the integrated circuit layout, and each training sample includes a sample layout.
In some embodiments, the machine learning model is configured to detect defects in the wafer, and each training sample includes an image of the manufactured wafer.
Fig. 8 illustrates a flowchart of an example method 800 for data processing according to an embodiment of the disclosure. For example, the method 800 may be performed by the data processing device 110 as shown in FIG. 1B. The method 800 is described below in connection with fig. 1B. It should be appreciated that method 800 may also include additional blocks not shown and/or that certain blocks shown may be omitted. The scope of the present disclosure is not limited in this respect.
At block 810, the data processing device 110 obtains a training sample set for a machine learning model. Each training sample in the training sample set is associated with at least one of a design or a measurement of the integrated circuit.
At block 820, the data processing device 110 generates an abnormal sample detection result based on the respective labels and the respective features of the training samples in the training sample set. The abnormal sample detection result indicates at least a plurality of candidate samples in the training sample set, each candidate sample being a candidate for an abnormal sample with a false label.
At block 830, the data processing apparatus 110 determines one or more target samples from the plurality of candidate samples based on the abnormal sample detection result for verifying whether the one or more target samples have an error tag.
At block 840, the data processing device 110 performs an update operation associated with the training sample set based on the respective verification results for the one or more target samples.
In some embodiments, generating the abnormal sample detection result includes: for a given training sample in the training sample set, determining a set of similar samples for the given training sample and respective similarities of the set of similar samples to the given training sample from the training sample set based on respective features of the training samples in the training sample set; determining an anomaly score for the given training sample based on the labels of the given training sample, the respective labels of the set of similar samples, and the respective similarities, the anomaly score indicating a probability that the given training sample has a false label; and in response to the anomaly score exceeding the threshold score, determining the given training sample as one of a plurality of candidate samples.
In some embodiments, determining the anomaly score for a given training sample comprises: for each similar sample in a set of similar samples, determining a weight for the similar sample based on whether the label of a given training sample matches the label of the similar sample; and determining an anomaly score based on the respective similarities and the respective weights of the set of similar samples.
In some embodiments, determining the anomaly score for a given training sample comprises: selecting one or more similar samples from a set of similar samples for which the tag does not match a given training sample; and determining an anomaly score based on the respective similarities of the one or more similar samples to the given training sample.
In some embodiments, generating the abnormal sample detection result includes: determining a set of training samples in the training sample set having matched tags based on the corresponding tags; for each pair of training samples in a set of training samples, determining a feature difference between the pair of training samples based on respective features of the pair of training samples to determine a plurality of feature differences; and determining at least one candidate sample of the plurality of candidate samples from the set of training samples based on the feature differences exceeding the threshold differences in response to the plurality of feature differences having feature differences exceeding the threshold differences.
In some embodiments, the abnormal sample detection result includes respective abnormal scores for a plurality of candidate samples, the abnormal score for each candidate sample indicating a probability that the candidate sample has an error signature, and determining one or more target samples from the plurality of candidate samples comprises: dividing the plurality of candidate samples into a plurality of groups of candidate samples based on the respective anomaly scores; for each set of candidate samples in the plurality of sets of candidate samples, determining a selection probability for the set of candidate samples based on the anomaly scores for the set of candidate samples to obtain corresponding selection probabilities for the plurality of sets of candidate samples; and selecting one or more target samples from the plurality of candidate samples based on the respective selection probabilities for the plurality of sets of candidate samples.
In some embodiments, selecting one or more target samples from the plurality of candidate samples includes multiple rounds of selection, and in each round of selection after the first round of selection, performing the following: determining remaining candidate samples by removing selected target samples in a previous selection of the round of selections; updating the corresponding selection probabilities of the plurality of groups of candidate samples based on the verification result of the selected target sample in the previous selection; the round of selected target samples are selected from the remaining candidate samples based on the updated respective selection probabilities.
In some embodiments, updating the respective selection probabilities for the sets of candidate samples includes: responding to the verification result of the first target sample selected in the previous round of selection to indicate that the first target sample has an error label, and increasing the selection probability of a first group of candidate samples to which the first target sample belongs; or in response to the verification result of the second target sample selected in the previous round of selection indicating that the second target sample does not have an error tag, reducing the selection probability of the second set of candidate samples to which the second target sample belongs.
In some embodiments, performing the update operation associated with the training sample set includes: determining, based on the respective verification results, a number of target samples verified as having an error tag and belonging to a third set of candidate samples; removing the third set of candidate samples from the training sample set to update the training sample set in response to the number meeting the preset condition; and reducing the contribution degree of the third group of candidate samples in the training of the machine learning model.
In some embodiments, reducing the contribution of the third set of candidate samples in the training of the machine learning model comprises: the training loss terms corresponding to the third set of candidate samples are weighted less in the total training loss of the machine learning model.
In some embodiments, the machine learning model is configured to detect hot spots in the integrated circuit layout, and each training sample includes a sample layout.
In some embodiments, the machine learning model is configured to detect defects in the wafer, and each training sample includes an image of the manufactured wafer.
Fig. 9 illustrates a block diagram of an electronic device 900 in which one or more embodiments of the disclosure may be implemented. The electronic device 900 may be used, for example, to implement the data processing device 110 shown in FIG. 1B. It should be understood that the electronic device 900 illustrated in fig. 9 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein.
As shown in fig. 9, the electronic device 900 is in the form of a general-purpose electronic device. Components of electronic device 900 may include, but are not limited to, one or more processors or processing units 910, memory 920, storage 930, one or more communication units 940, one or more input devices 950, and one or more output devices 960. The processing unit 910 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 920. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of electronic device 900.
Electronic device 900 typically includes multiple computer storage media. Such a medium may be any available medium that is accessible by electronic device 900 including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 920 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 930 may be a removable or non-removable medium and may include machine-readable media such as flash drives, magnetic disks, or any other medium that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within electronic device 900.
The electronic device 900 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 9, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 920 may include a computer program product 925 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.
The communication unit 940 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 900 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communications connection. Thus, the electronic device 900 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.
The input device 950 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 960 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 900 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with the electronic device 900, or with any device (e.g., network card, modem, etc.) that enables the electronic device 900 to communicate with one or more other electronic devices, as desired, via the communication unit 940. Such communication may be performed via an input/output (I/O) interface (not shown).
According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which one or more computer instructions are stored, wherein the one or more computer instructions are executed by a processor to implement the method described above.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims (14)

1. A method of data processing, comprising:
obtaining a set of training samples for a machine learning model, each training sample in the set of training samples being associated with at least one of a design or a measurement of an integrated circuit;
generating an abnormal sample detection result based on the corresponding labels and the corresponding features of the training samples in the training sample set, wherein the abnormal sample detection result at least indicates a plurality of candidate samples in the training sample set, and each candidate sample is a candidate of an abnormal sample with an error label;
determining one or more target samples from the plurality of candidate samples based on the abnormal sample detection results for verifying whether the one or more target samples have false labels, wherein the abnormal sample detection results include respective abnormal scores for the plurality of candidate samples, the abnormal score for each candidate sample indicating a probability that the candidate sample has a false label, and the plurality of candidate samples being partitioned into multiple sets of candidate samples based on the respective abnormal scores;
Determining a number of target samples verified as having an error label and belonging to a third set of candidate samples based on respective verification results for the one or more target samples; and
in response to the number meeting a preset condition, performing an update operation associated with the training sample set for the third set of candidate samples.
2. The data processing method according to claim 1, wherein generating an abnormal sample detection result includes:
for a given training sample in the training sample set, determining a set of similar samples for the given training sample and respective similarities for the set of similar samples to the given training sample from the training sample set based on respective features of training samples in the training sample set;
determining an anomaly score for the given training sample based on the labels of the given training sample, the respective labels of the set of similar samples, and the respective similarities, the anomaly score indicating a probability that the given training sample has a false label; and
in response to the anomaly score exceeding a threshold score, the given training sample is determined to be one of the plurality of candidate samples.
3. The data processing method of claim 2, wherein determining the anomaly score for the given training sample comprises:
for each similar sample in the set of similar samples, determining a weight for the similar sample based on whether the label of the given training sample matches the label of the similar sample; and
the anomaly score is determined based on the respective similarity and the respective weights of the set of similar samples.
4. The data processing method of claim 2, wherein determining the anomaly score for the given training sample comprises:
selecting one or more similar samples from the set of similar samples for which the tag does not match the given training sample; and
the anomaly score is determined based on respective similarities of the one or more similar samples to the given training sample.
5. The data processing method according to claim 1, wherein generating an abnormal sample detection result includes:
determining a set of training samples in the training sample set having matching tags based on the respective tags;
for each pair of training samples in the set of training samples, determining a feature difference between the pair of training samples based on respective features of the pair of training samples to determine a plurality of feature differences; and
In response to the plurality of feature differences having feature differences exceeding a threshold difference, at least one candidate sample of the plurality of candidate samples is determined from the set of training samples based on the feature differences exceeding the threshold difference.
6. The data processing method of claim 1, wherein determining one or more target samples from the plurality of candidate samples comprises:
for each set of candidate samples in the plurality of sets of candidate samples, determining a selection probability for the set of candidate samples based on the anomaly scores for the set of candidate samples to obtain corresponding selection probabilities for the plurality of sets of candidate samples; and
the one or more target samples are selected from the plurality of candidate samples based on the respective selection probabilities for the plurality of sets of candidate samples.
7. The data processing method of claim 6, wherein selecting the one or more target samples from the plurality of candidate samples comprises multiple rounds of selection, and wherein in each round of selection after a first round of selection the following is performed:
determining remaining candidate samples by removing selected target samples in a previous selection of the round of selections;
updating the respective selection probabilities of the plurality of sets of candidate samples based on the verification results of the selected target samples in the previous round of selection; and
The round of selected target samples are selected from the remaining candidate samples based on the updated respective selection probabilities.
8. The data processing method of claim 7, wherein updating the respective selection probabilities for the plurality of sets of candidate samples comprises:
increasing the selection probability of a first group of candidate samples to which the first target sample belongs in response to the verification result of the first target sample selected in the previous selection indicating that the first target sample has an error label; or (b)
And reducing the selection probability of a second group of candidate samples to which the second target sample belongs in response to the verification result of the selected second target sample in the previous round of selection indicating that the second target sample does not have an error label.
9. The data processing method of claim 1, wherein performing an update operation associated with the training sample set for the third set of candidate samples in response to the number meeting a preset condition comprises:
in response to the number satisfying the preset condition,
removing the third set of candidate samples from the training sample set to update the training sample set; or (b)
Reducing the contribution of the third set of candidate samples in the training of the machine learning model.
10. The data processing method of claim 9, wherein reducing the contribution of the third set of candidate samples in the training of the machine learning model comprises:
the weight of training loss terms corresponding to the third set of candidate samples in the total training loss of the machine learning model is reduced.
11. The data processing method of claim 1, wherein the machine learning model is configured to detect hotspots in an integrated circuit layout and each training sample comprises a sample layout.
12. The data processing method of claim 1, wherein the machine learning model is configured to detect defects in a wafer, and each training sample comprises an image of the manufactured wafer.
13. An electronic device, comprising:
at least one processing unit; and
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the electronic device to perform the method of any one of claims 1 to 12.
14. A computer readable storage medium having stored thereon a computer program, wherein the computer program is executable by a processor to implement the method of any of claims 1 to 12.
CN202311597853.3A 2023-11-23 2023-11-23 Method, apparatus and medium for data processing Active CN117313899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311597853.3A CN117313899B (en) 2023-11-23 2023-11-23 Method, apparatus and medium for data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311597853.3A CN117313899B (en) 2023-11-23 2023-11-23 Method, apparatus and medium for data processing

Publications (2)

Publication Number Publication Date
CN117313899A CN117313899A (en) 2023-12-29
CN117313899B true CN117313899B (en) 2024-02-23

Family

ID=89288729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311597853.3A Active CN117313899B (en) 2023-11-23 2023-11-23 Method, apparatus and medium for data processing

Country Status (1)

Country Link
CN (1) CN117313899B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20060092394A (en) * 2005-02-17 2006-08-23 삼성전자주식회사 Method and apparatus for eliminating outlier samples from database
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
JP2013080395A (en) * 2011-10-04 2013-05-02 Nippon Telegr & Teleph Corp <Ntt> Misclassification detecting device, method, and program
CN107346448A (en) * 2016-05-06 2017-11-14 富士通株式会社 Identification device, trainer and method based on deep neural network
CN109583297A (en) * 2018-10-25 2019-04-05 清华大学 Retina OCT volume data identification method and device
CN110991657A (en) * 2019-11-22 2020-04-10 深圳市魔数智擎人工智能有限公司 Abnormal sample detection method based on machine learning
CN111860674A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Sample class identification method and device, computer equipment and storage medium
CN112861962A (en) * 2021-02-03 2021-05-28 北京百度网讯科技有限公司 Sample processing method, sample processing device, electronic device and storage medium
CN114077859A (en) * 2020-08-17 2022-02-22 阿里巴巴集团控股有限公司 Abnormal sample detection method and device, electronic device and storage medium
CN115130535A (en) * 2022-04-08 2022-09-30 腾讯科技(深圳)有限公司 Sample noise identification method and device, electronic equipment and storage medium
CN115810135A (en) * 2021-09-14 2023-03-17 日本电气株式会社 Method, electronic device, storage medium, and program product for sample analysis

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3832281B2 (en) * 2001-06-27 2006-10-11 日本電気株式会社 Outlier rule generation device, outlier detection device, outlier rule generation method, outlier detection method, and program thereof
US10147049B2 (en) * 2015-08-31 2018-12-04 International Business Machines Corporation Automatic generation of training data for anomaly detection using other user's data samples
US10198576B2 (en) * 2015-12-10 2019-02-05 AVAST Software s.r.o. Identification of mislabeled samples via phantom nodes in label propagation
US11416757B2 (en) * 2019-11-04 2022-08-16 International Business Machines Corporation Classifier training using noisy samples
US11636389B2 (en) * 2020-02-19 2023-04-25 Microsoft Technology Licensing, Llc System and method for improving machine learning models by detecting and removing inaccurate training data
US20230096895A1 (en) * 2021-09-30 2023-03-30 Microsoft Technology Licensing, Llc Command classification using active learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20060092394A (en) * 2005-02-17 2006-08-23 삼성전자주식회사 Method and apparatus for eliminating outlier samples from database
JP2013080395A (en) * 2011-10-04 2013-05-02 Nippon Telegr & Teleph Corp <Ntt> Misclassification detecting device, method, and program
CN102880875A (en) * 2012-10-12 2013-01-16 西安电子科技大学 Semi-supervised learning face recognition method based on low-rank representation (LRR) graph
CN107346448A (en) * 2016-05-06 2017-11-14 富士通株式会社 Identification device, trainer and method based on deep neural network
CN109583297A (en) * 2018-10-25 2019-04-05 清华大学 Retina OCT volume data identification method and device
CN110991657A (en) * 2019-11-22 2020-04-10 深圳市魔数智擎人工智能有限公司 Abnormal sample detection method based on machine learning
CN111860674A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Sample class identification method and device, computer equipment and storage medium
CN114077859A (en) * 2020-08-17 2022-02-22 阿里巴巴集团控股有限公司 Abnormal sample detection method and device, electronic device and storage medium
CN112861962A (en) * 2021-02-03 2021-05-28 北京百度网讯科技有限公司 Sample processing method, sample processing device, electronic device and storage medium
CN115810135A (en) * 2021-09-14 2023-03-17 日本电气株式会社 Method, electronic device, storage medium, and program product for sample analysis
CN115130535A (en) * 2022-04-08 2022-09-30 腾讯科技(深圳)有限公司 Sample noise identification method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Complete random forest based class noise filtering learning for improving the generalizability of classifiers》;Xia Shuyin et al.;《IEEE Transactions on Knowledge and Data Engineering》;全文 *
《基于异常检测的标签噪声过滤框架》;许茂龙;《计算机科学》;第1-19页 *

Also Published As

Publication number Publication date
CN117313899A (en) 2023-12-29

Similar Documents

Publication Publication Date Title
US10223615B2 (en) Learning based defect classification
US20120016824A1 (en) Method for computer-assisted analyzing of a technical system
JP6584250B2 (en) Image classification method, classifier configuration method, and image classification apparatus
CN107577605A (en) A kind of feature clustering system of selection of software-oriented failure prediction
CN109670255B (en) Typical simulation condition recommendation method for time sequence parameter clustering
CN110909868A (en) Node representation method and device based on graph neural network model
Zhang et al. Diagnostic system based on support-vector machines for board-level functional diagnosis
US20230178399A1 (en) Systems and methods for systematic physical failure analysis (pfa) fault localization
Rahim et al. Software defect prediction with naïve Bayes classifier
CN112836735A (en) Optimized random forest processing unbalanced data set method
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
Pan et al. Unsupervised root-cause analysis for integrated systems
JP7150918B2 (en) Automatic selection of algorithm modules for specimen inspection
CN117313899B (en) Method, apparatus and medium for data processing
CN117313900B (en) Method, apparatus and medium for data processing
Liu et al. Knowledge transfer in board-level functional fault identification using domain adaptation
US20220230028A1 (en) Determination method, non-transitory computer-readable storage medium, and information processing device
CN113127342B (en) Defect prediction method and device based on power grid information system feature selection
WO2022059135A1 (en) Error cause estimation device and estimation method
Bolchini et al. Machine learning-based techniques for incremental functional diagnosis: A comparative analysis
CN117561502A (en) Method and device for determining failure reason
CN115185814B (en) Multi-defect positioning method, system and equipment based on two-dimensional program frequency spectrum
JP7348945B2 (en) Information processing method and information processing system
CN116996527B (en) Method for synchronizing data of converging current divider and storage medium
EP3772025A1 (en) Method for determining at least one defective node in an input graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant