WO2023278190A1 - Method for determining corrective film pattern to reduce semiconductor wafer bow - Google Patents
Method for determining corrective film pattern to reduce semiconductor wafer bow Download PDFInfo
- Publication number
- WO2023278190A1 WO2023278190A1 PCT/US2022/034138 US2022034138W WO2023278190A1 WO 2023278190 A1 WO2023278190 A1 WO 2023278190A1 US 2022034138 W US2022034138 W US 2022034138W WO 2023278190 A1 WO2023278190 A1 WO 2023278190A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- wafer
- model
- bow
- film pattern
- neural network
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000004065 semiconductor Substances 0.000 title claims abstract description 26
- 230000009466 transformation Effects 0.000 claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000004519 manufacturing process Methods 0.000 claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 claims abstract description 26
- 238000000844 transformation Methods 0.000 claims abstract description 22
- 230000008569 process Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 28
- 238000005457 optimization Methods 0.000 claims description 20
- 238000004088 simulation Methods 0.000 claims description 11
- 235000012431 wafers Nutrition 0.000 description 150
- 239000010408 film Substances 0.000 description 109
- 238000010200 validation analysis Methods 0.000 description 22
- 238000013527 convolutional neural network Methods 0.000 description 15
- 238000010801 machine learning Methods 0.000 description 15
- 238000013459 approach Methods 0.000 description 14
- 230000008901 benefit Effects 0.000 description 11
- 239000000463 material Substances 0.000 description 10
- 238000012360 testing method Methods 0.000 description 10
- 238000000151 deposition Methods 0.000 description 9
- 230000008021 deposition Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000009826 distribution Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 7
- 239000000758 substrate Substances 0.000 description 7
- 238000012937 correction Methods 0.000 description 6
- 238000005530 etching Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 4
- 238000013519 translation Methods 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000006073 displacement reaction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000012897 Levenberg–Marquardt algorithm Methods 0.000 description 2
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 239000013078 crystal Substances 0.000 description 2
- 229910021419 crystalline silicon Inorganic materials 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000001459 lithography Methods 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 229910052710 silicon Inorganic materials 0.000 description 2
- 239000010703 silicon Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000010409 thin film Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 230000005483 Hooke's law Effects 0.000 description 1
- 229910052581 Si3N4 Inorganic materials 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009501 film coating Methods 0.000 description 1
- 230000008014 freezing Effects 0.000 description 1
- 238000007710 freezing Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000000206 photolithography Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- HQVNEWCFYHHQES-UHFFFAOYSA-N silicon nitride Chemical compound N12[Si]34N5[Si]62N3[Si]51N64 HQVNEWCFYHHQES-UHFFFAOYSA-N 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
- 238000000427 thin-film deposition Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L21/00—Processes or apparatus adapted for the manufacture or treatment of semiconductor or solid state devices or of parts thereof
- H01L21/67—Apparatus specially adapted for handling semiconductor or electric solid state devices during manufacture or treatment thereof; Apparatus specially adapted for handling wafers during manufacture or treatment of semiconductor or electric solid state devices or components ; Apparatus not specifically provided for elsewhere
- H01L21/67005—Apparatus not specifically provided for elsewhere
- H01L21/67242—Apparatus for monitoring, sorting or marking
- H01L21/67288—Monitoring of warpage, curvature, damage, defects or the like
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/091—Active learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
Definitions
- the present invention relates generally to semiconductor wafer fabrication methods. More specifically, it relates to techniques for reducing wafer bow.
- complex structures are fabricated using sequences of thin film deposition, photolithography, and etching. Complex structures are fabricated using lithography and etching masks with small feature sizes and repeating these steps many times for various materials.
- the lithography and etching masks In each processing step, the lithography and etching masks must be precisely aligned with an absolute coordinate system for each device on a wafer. As device dimensions get smaller, the tolerance for spatial deviations in mask alignment becomes stricter. If the spatial offset (or “overlay error”) in two or more processing steps is too great, the device will not work.
- One phenomenon that leads to overlay error is wafer bowing and warping during processing [1-3]. Wafer bow occurs when distinct material films with unequal thermal expansion coefficients undergo a temperature change during processing.
- the wafer bow signatures can be complex and the in-plane distortion that arises due to the bow signature is not easily determined [1,4].
- Applying a corrective film to reduce or eliminate wafer bow is one route to minimize overlay error and increase device yield.
- One technical challenge that arises with this solution is deciding which corrective film pattern to apply to the wafer that will effectively remove bow signature and adequately reduce overlay error.
- the wafer bow problem can be modeled as a linear elasticity problem and solved using a numerical approach (using finite element method), yet the calculation time for this approach is too long to be practical in an on-line semiconductor fabrication environment.
- a “pixel sum” approach is used, which determines the stress influence of a single pixel then sums to total stress correction of all pixels to get global stress correction (using some empirical correction factors) [9].
- FEM finite element method
- they use finite element method (FEM) to determine equibiaxial stress of a single pixel, yet they do not employ a holistic FEM approach as described below. Further, there is no surrogate model or machine learning component to their invention.
- FEM finite element method
- Other Tokyo Electron patents regarding similar applications include a substrate holding apparatus to improve bow metrology [11], location-specific tuning of stress to control overlay (general concept) [12], and a method to correct wafer bow and overlay using spatially patterned particle bombardment [13].
- the invention described below overcomes this technical challenge and provides a solution to the wafer bow correction problem by using a machine learning surrogate model approach.
- Our surrogate model successfully suggests a corrective film pattern to reverse a generic wafer bow signature with computation time of about three orders of magnitude less than the finite element method approach.
- the invention provides a method for generating a corrective film pattern for reducing wafer bow in a semiconductor wafer fabrication process, the method comprising: inputting to a neural network a wafer bow signature for a predetermined semiconductor fabrication step; generating by the neural network a corrective film pattern corresponding to the wafer bow signature; wherein the neural network is trained with a training dataset of wafer shape transformations and corresponding corrective film patterns.
- the training dataset may be generated using a simulation to compute the corrective film patterns from the wafer shape transformations for the predetermined semiconductor fabrication step.
- the training dataset may be generated by experimentally determining the corrective film patterns corresponding to the wafer shape transformations.
- the training dataset may be generated using a finite element method to solve a linear elasticity problem and using an optimization framework to select the wafer shape transformations that minimize a cost function.
- the method may further include performing active learning feedback to refine the neural network.
- the neural network may be implemented as a convolutional U-Net, a Zernike convolutional neural network, a conditional variational autoencoder, or as a conditional generative adversarial network.
- the conditional generative adversarial network may include a generator implemented as a U-Net with skip connections or a discriminator implemented as a convolutional classifier.
- Figs. 1A, IB, 1C show processing flow diagrams of an embodiment of the invention.
- the different arrow styles indicate different information flow.
- Thick solid arrows indicate the movement of physical wafers in a semiconductor fab.
- the thin solid arrows indicate transfer of 2D array data, such as datasets of film patterns and wafer shape transformations.
- the dashed arrows indicate transfer of model parameter information, such as Zernike coefficients or machine learning model weights and biases.
- Fig. 2 illustrates the linear elasticity problem as solved using FEM according to an embodiment of the invention.
- Fig. 3 shows example results from film pattern optimization according to an embodiment of the invention.
- Fig. 4 shows the impact of training dataset size on the validation error according to an embodiment of the invention.
- Fig. 5 shows details of a surrogate model architecture according to an embodiment of the invention.
- Fig. 6 shows a UNet architecture of forward component of surrogate model according to an embodiment of the invention.
- Fig. 7 shows a Zernike CNN inverse model architecture according to an embodiment of the invention.
- Fig. 8 shows an example of a film pattern and residual bow prediction, for the task of taking an input wafer bow signature and flattening the wafer, according to an embodiment of the invention.
- Wafer bow signature The height (z) of a semiconductor wafer for each horizontal position on the wafer. A wafer is bowed due to various stresses accumulating during fabrication.
- the coordinate system to define a wafer bow signature is standardized to give a unique description.
- the z data is fit using zernike polynomials (as described below), then tilt is removed by subtracting the Z 1 1 and Z 1 -1 modes from the shape.
- minimum z is subtracted so that all z values are positive with minimum height of 0.
- wafer bow signature refer to the wafer bow prior to deposition of a corrective film pattern.
- a corrective film pattern is a pattern of a corrective film applied onto a wafer in order to modify the wafer bow signature.
- a pattern is achieved by deposition of a uniformly thick film, then selectively etching away many small areas of the film, leaving parts of the original film in place. Because the density of the etched away areas of the pattern can vary across the surface, the average percent area covered by the film in localized regions is a function of position on the surface of the film. For example, a 1 mm square region in one position on the surface may have 50% coverage by etching away 20 ⁇ m squares within the region to form a checkerboard where half of the 20 pm squares are etched away and half remain. Another 1 mm square region in another position on the surface may have 75% coverage by etching away a quarter of the squares.
- Wafer shape transformation The desired change in wafer bow signature due to deposition for a corrective film pattern.
- the wafer shape transformation is the shape transformation that will make the wafer flat, which is the negative of the wafer bow signature prior to deposition of a corrective film pattern.
- Another possibility is specific wafer shape transformation to reduce higher order bow and/or minimize overlay error directly could be employed.
- Residual Bow The height (z) of a semiconductor wafer for each horizontal position on the wafer after deposition of a corrective film pattern to modulate the wafer bow.
- the data preprocess to obtain a unique shape is analogous to that described for wafer bow signature.
- Linear elasticity problem A structural analysis problem where the linear elasticity mathematical model is assumed, i.e. strain (deformation) of an elastic object is proportional to applied stress.
- An elastic object is an object that would return to its original shape if the stress was removed (in contrast to yielding) .
- Neural network a machine learning model formed by an arrangement of nodes and activation functions that can learn a nonlinear function between inputs and outputs (see Section 3.3)
- Finite element method A numerical approximation method for solving partial differential equations for 2D and 3D problems used in many engineering domains (see Section 3.2)
- Optimization framework a strategy to find a solution for the inverse of an FEM simulation by parameterizing a corrective film pattern with Zernike polynomials (see Section 3.4), and using an optimization algorithm to identify a suitable film pattern (see Section 4.3)
- Active learning feedback Using a machine learning model to choose a batch of unlabeled data points that would give maximum improvement to the neural network if they were labeled. Then labeling these data points (either using a simulator or experiments), and providing results to the neural network for improvement (see Section 4.6)
- FEM Finite Element Method
- test function v and solution u belong to Hilbert spaces (an infinite dimensional function space), and an important component of the weak formulation is that it must hold for all test functions in the Hilbert space.
- solution u belongs to the same Hilbert space as the test functions, then we look for an approximate solution u h ⁇ u in a finite-dimensional subspace of the Hilbert space.
- Our approximate solution can be expressed as a linear combination of a set of basis functions ⁇ i ih that subspace
- finite element analysis allows us to take a system governed by a partial differential equation (here linear elasticity in three dimensions), then discretize the problem into elements to find an approximate solution by solving a linear set of equations. The finer the mesh, the greater the number of basis functions and the closer the approximate solution will be to the real solution.
- Neural Networks are a general framework that arrange units or nodes in a pre-defined architecture to create a complex non-linear relationship between inputs and outputs.
- Each neural network has both an input and output layer where the layer shape is dictated by the input and output type.
- the most common example of a neural network is a fully connected, feed-forward network with hidden layers in addition to input and output layers (called a multi-layer perceptron) . Values at each node are propagated to nodes in subsequent layers with an activation function that is parameterized with weights and biases.
- the hidden layers are not directly connected to inputs and outputs but are included to automatically extract features from the input layer which aid in output determination.
- a neural network In the process of training, a neural network is exposed to many labeled examples, or input examples where the correct output is known. In a training iteration, gradient calculations and backpropagation modulate the weights and biases of each node to improve a predetermined loss function. After training, the weights and biases typically remain fixed, and the network can perform inference on unseen data using the non-linear function learned in training.
- Certain input and output types benefit from more sophisticated network architectures than the simple multi-layer perceptron.
- convolutions are typically used to extract features.
- filters are used to transform 2D input data into feature maps with various channels.
- multiple convolution layers are employed to make feature maps from preceding feature maps, and often a fully connected output architecture is used to determine outputs from the final feature map layer.
- a convolutional neural network can be thought of as a regularized multi-layer perceptron; instead of each input pixel being fully connected to every node in the next layer, convolutions are used to extract features from an arrangement of the pixel with neighboring pixels.
- the concepts from a multi-layer perceptron and convolutional neural networks provide the framework for the more sophisticated neural network architectures employed in this work.
- Zernike polynomials to parameterize a general wafer shape to describe wafer bow signatures, desired wafer shape transformations, or residual bow.
- the coefficient c a represents a Zernike coefficient for each polynomial.
- N becomes larger the error between the true shape and the Zernike representation will decrease.
- a general wafer shape can be expressed in a parameterized way with N coefficients c a .
- Embodiments of the invention provide a process utilizing a finite element method (FEM) along with machine learning to optimize the corrective film pattern with a computation time relevant for on-line deployment.
- FEM finite element method
- an FEM model solver 100 is built to solve the linear elasticity problem for the appropriate wafer and film geometry used in a particular semiconductor fabrication process step.
- a corrective film pattern optimization framework 102 is built on top of this FEM model solver 100 to optimize for the corrective film pattern that results in the greatest reduction in wafer bow.
- This FEM model and optimization framework are used to generate the corrective film patterns 106 from a dataset of corresponding desired wafer shape transformations 104.
- the dataset generation is typically accomplished by a simulation engineer using standard central processing unit. Dataset generation can be parallelized if desired.
- the dataset 106 is used together with the dataset of desired wafer shape transformations 104 to train a neural network in a machine learning surrogate model 108.
- This is typically performed by a machine learning engineer. They will also package the model for deployment in production. Training the surrogate model will benefit from graphical processing unit acceleration, so hardware that supports deep learning with graphical processing unit is recommended.
- the surrogate model 108 defines the general model architecture, layer shapes, and hyperparameters, while the trained model 112 represents an instance of the surrogate model with the specific model weights that minimize the difference between predicted and actual wafer shape transformations (for the forward model) or minimize the residual wafer bow (for the compiled model) .
- This trained model 112 is then delivered to and deployed in a semiconductor fabrication facility 110 where it is used to perform the same optimization task as the optimization framework, but in a small fraction of the computation time.
- a process engineer will use the packaged model in collaboration with the machine learning engineer.
- the computing system used for this last step should include hardware that can integrate with the tools in the fab and have graphical processing unit capabilities for the retraining steps.
- Interested parties include companies that run semiconductor fabs, companies that provide equipment to semiconductor fabs, and companies that provide software to any of the equipment or fab companies.
- the trained model 112 may be retrained during deployment using a validation dataset of physical wafers (not simulated) to learn any data distribution shift between the simulated and actual wafer shape transformations.
- a specialized active learning approach can be used to optimally choose validation sample points for this retraining step.
- an on-line retraining scheme can be used to further reduce wafer bow or overlay error and account for any data drift.
- the linear elasticity partial differential system of equations is solved using FEM.
- the role of the FEM model solver 100 is to take a non-uniform film pattern 120, solve the linear elasticity partial differential equations 122 to determine the stresses that applying such a corrective film has on a wafer, then determine the wafer shape transformation 124 undergone by the wafer due to those stresses.
- the wafer and corrective film are modeled as a disk with non-uniform film, and the film stress is modeled by defining a temperature change and setting the coefficient of thermal expansion offset by calibrating to known uniform film stress using Stoney’s equation.
- the maximum film thickness in the simulation is set using the full thickness of the printed corrective film, and the thickness pattern of corrective film across the wafer in the simulation is defined to replicate the percent coverage pattern of printed corrective film.
- the film is discretized using a matrix and a smooth cubic interpolation function is used on this discretized pattern to determine precise thickness at each node in FEM simulation.
- the matrix dimensions can be chosen based on the desired spatial resolution; most likely dimensions on the order of 100-1000 will be appropriate.
- FEA finite element analysis
- the FEA system is specified using the partial differential equation system described in section 3.1.
- a nonlinear solver employing the Gauss-Newton iteration scheme solves the FEM system to give the approximate result. Note that specifying only a point Dirichlet condition leads to an infinite solution set because the solution could be tilted in an arbitrary direction.
- the FEM model incorporates knowledge from the specific silicon wafer used and information about the corrective film.
- the stiffness tensor C uses published material properties for the silicon crystal structure of interest, (e.g. c-Si(100) or c-Si(lll)) [4]. Note that the wafer crystal structure is cubic and thus has anisotropic structural behavior, and the stiffness tensor can be described with three parameters (rather than simply the elastic modulus and Poisson’s ratio as in isotropic materials).
- the wafer dimensions (thickness and radius) are specified based on the wafer used in the process of interest.
- the corrective film stress value is specified using the corrective film of interest in the process.
- the stress value is calibrated using a simple experiment with deposition of a uniform film at several known thicknesses and defining the temperature difference- coefficient of thermal expansion product that achieves the measured wafer bow.
- the printable area is specified using the limitations of the corrective film deposition tool (typically there is a region near the perimeter of the wafer where it is not feasible to deposit corrective film) .
- the corrective film pattern optimization framework 102 is built on top of the FEM solver 100 as shown in Fig. IB.
- the optimization framework takes a dataset of wafer shapes 104 as input.
- the optimization framework finds the best corrective film pattern (parameterized by Zernike coefficients 126) as predicted by the FEM solver 100 that will achieve this shape transformation.
- the optimization framework generates from the dataset 104 of wafer shape transformations a dataset 106 of corresponding corrective film patterns.
- the desired wafer shape transformation 128 is typically the transformation that will minimize total wafer bow (negative of the wafer bow signature prior to corrective film deposition), or a shape transformation known to minimize overlay error.
- the corrective film pattern is parameterized using a defined number of Zernike polynomials 126, where the c a coefficients are the parameters defining the corrective film pattern.
- the framework checks in block 132 if the current predicted wafer shape transformation 124 from the FEM solver 100 has converged sufficiently close to the desired shape transformation 128.
- the cost function to minimize during the optimization is preferably defined as the absolute difference between the desired shape transformation 128 and the shape transformation achieved with the predicted wafer shape transformation 124.
- the Zernike coefficients input to the FEM solver 100 are optimized using the Levenburg-Marquardt algorithm 130. After each iteration of the algorithm, the resulting wafer shape transformation difference cost function is evaluated using the present film pattern.
- Fig. 3 shows an initial desired wafer shape transformation 300, input wafer shape transformation with first order bow component subtracted 302, and film pattern solution returned by film pattern optimization 304.
- the optimization framework 102 is one strategy to generate a dataset that is used to train the surrogate model 108. More generally, the training dataset can be a) generated from real wafer measurements, b) generated from FEM wafer bow simulations, and/or c) generated using the corrective film optimization framework 102. For (a) and (b), a list of film patterns is generated, where the film patterns could be generated randomly using a Zernike coefficient basis and could have some bias towards the film patterns most likely to be employed during production. The advantage of (a) is that the data distribution will more closely resemble that of production. However, obtaining a large enough dataset to train the deep surrogate model from scratch using only experimental data may be infeasible, so strategy (b) maybe preferred.
- Dataset generation strategies (a) and (b) have the disadvantage of not enumerating the space of possible wafer shape transformations directly; instead the possibilities are sampled in “inverse space”, or in the film pattern space.
- Strategy c) allows desired wafer shape transformations to be specified directly, which could also be accomplished using a random distribution of Zernike coefficients with bias toward the shape transformations most likely to be required in production. If 20 Zernike polynomials are used and the Levenberg-Marquardt algorithm takes on average 10 iterations to converge, strategy c) will take —200 times as long to generate a dataset of the same size as strategy b).
- a train-validation-test split strategy may be used to estimate the performance of the surrogate model on unseen data.
- Graphs 400, 402, 404, 406, 408, 410 show the validation error vs epoch for training sizes of 125, 250, 500, 1000, 2000, 4000, respectively.
- a training dataset of —4000 allows for a mean absolute percentage error of less than 1%, while the surrogate model overfits to the training data when a dataset size of ⁇ 1000 is used.
- the train-validation strategy gives confidence that the model will generalize well to unseen data.
- the training dataset should be chosen with care such that all examples expected to be observed in production will come from the same distribution. If the training set contains wafers with first order bow of 100- 500 pm and maximum absolute higher order bow of 0-30 pm, then it would perform well on wafer examples within the ranges (even if the exact shape has not been observed previously), yet it would likely perform poorly on corrections for wafers that have bow signatures that fall well outside of these ranges.
- a dataset generation strategy combining all the above approaches may be preferable.
- a dataset containing corrective film patterns and corresponding wafer shape transformations from the simulation can be used to initially train the surrogate model from scratch. Then, a smaller “validation dataset” of real wafers can be used to understand the differences between the real and simulated scenarios. More details on how the validation dataset can be chosen to maximize performance for a limited dataset are provided in section 4.6.1. 4.5 Machine Learning Surrogate Model
- Fig. 1C shows details of the machine learning surrogate model training 108 and deployment 110 of that trained model 112 in production.
- the surrogate model employed here for the wafer bow problem is a deep neural network based upon the convolutional neural network architecture discussed in section 3.3.
- the input is a wafer bow shape 152
- the model 148 is used to infer 150 the best corrective film 154 to print to transform the wafer shape into a new shape that will minimize overlay error.
- one objective is to make the wafer as flat as possible which will reduce overlay error.
- the FEM simulation can predict the wafer shape transformation based on the corrective film, determining the corrective film from the desired wafer shape transformation can be considered the inverse problem.
- our model has an inverse model-forward model architecture, where the inverse model 140 determines film pattern as output based on wafer shape transformation input and the forward model 142, 144 determines wafer shape transformation output based on the film pattern input.
- Fig. 5 provides additional details of the surrogate model architecture.
- the inverse model 500 uses a convolutional neural network 506 to output corrective film pattern 508 based on wafer shape transformation 504.
- the forward model 502 uses a convolutional neural network 510 to output wafer shape transformation 512 based on corrective film pattern 508.
- a desired wafer shape transformation is input to the surrogate model 150, where the desired wafer shape transformation is simply the negative of the wafer bow signature measured in 152 if the objective is to make the wafer flat. Then the model returns both the recommended corrective film pattern and the resulting shape transformation predicted upon application of that corrective film pattern. The corrective film pattern is then used in the semiconductor process to print a corrective film 154 (see Fig. 8.). The predicted residual bow after deposition of corrective film is the difference between input and output shapes.
- the forward model 142, 144 uses a convolutional UNet and the inverse model 140 uses a Zernike CNN. Both are described in detail below.
- the architecture used for the forward model 142, 144 is a convolutional UNet (Fig. 5, 502) .
- This structure is a specialized case of an encoder-decoder model where the encoder down-samples into the bottleneck and the decoder up-samples to the resulting output array.
- the encoder part functions similarly to a typical convolutional neural network (CNN) with a series of convolution operations to extract features from the inputs.
- Fig. 6 details the UNet architecture of the forward model in the surrogate model. It includes symmetric skip connections at each layer which enable low frequency information to pass through from input to output. In the encoder/down-sampling section, in the first three layers the number of features is doubled each layer.
- each step up-samples the feature map followed by an up-convolution to reduce the number of feature channels and then concatenates with the skip connection from its sibling layer in the encoder section.
- each encoder and decoder unit is denoted with subscript “e” and “d” respectively.
- the encoder unit C e 64 denotes a 2D convolution layer with 64 filters, kernel size of 4x4, stride length of 2 (in each dimension), followed by batch normalization and Leaky ReLU activation.
- Batch normalization standardization of inputs to a layer by mini batches during training
- it accelerates training prevents internal covariate shift
- provides some regularization note that the very first encoder layers does not employ batch normalization
- the decoder unit C d 512 denotes a transposed 2D convolution layer with 512 filters, kernel size of 4x4, stride length of 2, followed by batch normalization and ReLU activation.
- the first several decoder layers also employ dropout for further regularization.
- the B layer denotes the bottleneck (a simple convolution layer) and the A denotes tanh activation to output. All layers have weights initialized with a random normal distribution.
- the above general architecture is used, yet specific hyperparameters are tuned by running many experiments with train-validation dataset split and choosing the set of hyperparameters that minimize the validation set error.
- the mean of the mean squared error is the metric used in network training (take the difference between predicted shape and actual shape, square it, take mean across shape, then take mean across samples).
- the Adam optimizer is typically used for optimizing the network.
- Some hyperparameters than can be tuned by validation error examination include number of decoder and encoder layers, number of filters in each layer, dropout fraction in dropout layers, Leaky slope of Leaky ReLU, batch size, and the learning rate and beta parameters for the Adam optimizer.
- the compiled surrogate model 146 at inference time sends inputs to the inverse model 140 then to the forward model 144.
- the first step is providing the desired dataset to the forward model, where the film pattern is the input to the UNet and the wafer shape is the output. Then, the hyperparameters can be tuned and the forward model can be trained until the forward model performance is satisfactory. The weights of the forward model can then be frozen, and the model can be used to train the inverse and compiled models as described below.
- the forward model 142 is first trained using the film pattern as the input and the wafer shape transformation as the output. Then, the inverse 140 and compiled 146 model are trained by loading the pre-trained forward model, freezing the weights in the forward model layers 144, then training the compiled model using the wafer shape transformation as both input and output.
- a Zernike CNN is used as the inverse model. Basically, this is a CNN with multiple convolutional layers with a fully connected output, which is similar to a CNN that could be used for image classification. The difference is that the units in the last dense layer in the fully connected output provide the Zernike coefficients that are used to build the film pattern shape according to eq. 13 above.
- the Zernike CNN strategy allows for regularization of the film pattern output to bias towards smooth film patterns that are practically attainable.
- the details of the Zernike CNN inverse model are shown in Fig. 7.
- the input wafer shape transformation is sent to a series of convolution layers to create feature maps, where here C 64 denotes a 2D convolution layer with 64 filters, kernel size of 3x3, stride length of 1 (in each dimension), with ReLU activation and 2D max pooling with pool size of 2.
- C 64 denotes a 2D convolution layer with 64 filters, kernel size of 3x3, stride length of 1 (in each dimension), with ReLU activation and 2D max pooling with pool size of 2.
- the output is flattened then fully connected to a dense layer, where D 64 denotes a dense layer with 64 units and dropout.
- the D 64 is fully connected to a second dense layer which contains N Zernike coefficients (the ⁇ coefficients in equation 13).
- the “Z” layer constructs a wafer shape transformation using equation 13 with the c a s from the previous layer.
- the result is sent to tanh activation
- the Zernike CNN returns a film pattern given a wafer shape transformation.
- the resulting film pattern is then sent into the pre-trained forward model (which returns a wafer shape transformation given a film pattern) .
- the compiled model is trained by minimizing the difference between the input shape of the inverse model and the output shape from the forward model (again the mean of the mean squared error is typically used as the error function) .
- the forward model weights are frozen during training of the compiled model so that only the weights in the inverse model are modulated.
- the inverse/compiled model also uses a training- validation split to choose a set of hyperparameters to minimize validation error.
- Fig. 8 illustrates an example of a wafer bow signature 800 input and the corrective film pattern 802 and the predicted residual bow output 804 generated by the surrogate model.
- the surrogate model prediction time for a single instance is —0.1 seconds (3-4 orders of magnitude faster than FEM), and the prediction time is even faster when many wafer shape transformations are processed at the same time.
- the Zernike CNN provides greater shape regularization (bias towards smooth shapes) while the UNet is more versatile to fit a 2D function with higher variance/noise.
- both the UNet and Zernike-CNN will perform well as the forward model where the output directly impacts the network cost function, and thus the best choice will depend on training dataset size and compute resources.
- the UNet will likely allow for a closer fit to a greater variety of wafer shape transformations.
- the inverse model is trained (where the inverse model output does not directly impact the cost function), the regularization and bias toward smooth 2D functions that the Zernike CNN provides seems beneficial, but this could also depend on the precise dataset.
- Another model concept is a conditional generative adversarial network (cGAN) [25], such as the pix2pix model [24].
- GAN conditional generative adversarial network
- a GAN model has a different training strategy where a generator and discriminator model try to fool each other, and both get better over time.
- the task of the generator model here is to generate an image that is a realistic pair to some input image, while the task of the discriminator is to classify input image pairs as real or fake (where the fake pairs are provided by the generator) .
- the generator could have a UNet architecture, and the discriminator could have a simple CNN architecture for binary classification (real or fake) .
- Another strategy is to use patches so that rather than determining if an entire image pair is real or fake, this is done on small patches across the image [24] .
- the cGAN strategy has many benefits for image-to-image translation tasks (where here, image-to-image translation could mean film pattern to wafer shape transformation or wafer shape transformation to film pattern translation) .
- image-to-image translation could mean film pattern to wafer shape transformation or wafer shape transformation to film pattern translation
- the adversarial loss preserves high frequency “sharpness” in the images, in contrast to models that are trained using mean squared error which blur high frequency information.
- Another advantage is that in cases where there are multiple plausible result images that are equally valid (as perhaps in the case of the inverse wafer bow problem), the cGAN will provide one distinct good solution rather than an average of various possible good solutions.
- our experiments show that the UNet and Zernike CNN models trained with mean squared error give better results and have more stable training than cGAN approaches.
- Another model concept is the probabilistic encoder-decoder. Examples of these strategies include the conditional variational autoencoder [26,27] and the probabilistic UNet [28]. In these approaches, the result is a probability distribution at each position on the result 2D array rather than a precise shape.
- the compiled model strategy involves taking the expected value of film pattern and wafer shape transformation calculations, so the benefit of a probabilistic approach is not directly evident.
- using a probabilistic model could enable benefits in the active learning and on-line retraining steps described below.
- the model can be further improved through re-training using data from actual wafers. These improvements can be realized either pre-production using a validation dataset or on-line in production.
- a secondary dataset containing metrology on physical wafers can be used to learn any differences between the simulated wafer bow behavior and the behavior of real wafers.
- the dataset is likely much smaller than the simulated dataset because the cost per sample is much greater.
- a specialized active learning approach is used to choose the best film patterns to get the greatest model improvement for a small amount of examples in the validation dataset.
- Active learning is a field of machine learning where unlabeled data is abundant but labeled examples are scarce.
- An uncertainty estimator is used to determine the unlabeled examples with maximum uncertainty, and these are chosen to be labeled by an oracle with the assumption that these examples will provide the maximum benefit to the model.
- our case is a bit different than the typical active learning problem because a) the distribution of data is different between original training and new labels from the oracle and b) the compiled surrogate model is deterministic (no probability distribution available).
- auxiliary model where a probabilistic auxiliary model is trained on the error in the validation dataset, then samples are chosen for the next batch using a combination of high error and high uncertainty.
- the active learning model suggests batches of film patterns to print for validation, then updates the surrogate model with this new validation data, then suggests a new batch for validation. This process repeats until model performance on the validation data is satisfactory. 4.6.2 Model Improvements on-line during production
- the surrogate model retraining 158 will be provided with consistent feedback in the form of downstream metrology results 156 from a subset of wafers. This data can be used to monitor the surrogate model performance and retrain and update the production model 150 as necessary.
- a retraining policy will be implemented that specifies batch size, sample weight, and model training hyperparameters (e.g. optimizer learning rate, number of training epochs, model freeze layers, etc.).
- training-validation- test splits within the new data will be used to determine benefit over currently deployed models (where validation is used to determine best re-training policy, then test set is used to estimate performance of new model on unseen data).
- the process owner will be alerted that the new model is available and can decide when to deploy the update. This process enables a surrogate model that is robust to dataset drift in a dynamic fabrication environment.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Condensed Matter Physics & Semiconductors (AREA)
- Manufacturing & Machinery (AREA)
- Computer Hardware Design (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Power Engineering (AREA)
- Testing Or Measuring Of Semiconductors Or The Like (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22833915.6A EP4364190A1 (en) | 2021-06-27 | 2022-06-20 | Method for determining corrective film pattern to reduce semiconductor wafer bow |
KR1020247003096A KR20240027069A (en) | 2021-06-27 | 2022-06-20 | Method for determining calibration film patterns to reduce semiconductor wafer bow |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/359,626 US20220415683A1 (en) | 2021-06-27 | 2021-06-27 | Method for determining corrective film pattern to reduce semiconductor wafer bow |
US17/359,626 | 2021-06-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023278190A1 true WO2023278190A1 (en) | 2023-01-05 |
Family
ID=84542491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/034138 WO2023278190A1 (en) | 2021-06-27 | 2022-06-20 | Method for determining corrective film pattern to reduce semiconductor wafer bow |
Country Status (4)
Country | Link |
---|---|
US (1) | US20220415683A1 (en) |
EP (1) | EP4364190A1 (en) |
KR (1) | KR20240027069A (en) |
WO (1) | WO2023278190A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080182344A1 (en) * | 2007-01-30 | 2008-07-31 | Steffen Mueller | Method and system for determining deformations on a substrate |
US20170351952A1 (en) * | 2016-06-01 | 2017-12-07 | Kla-Tencor Corporation | Systems and methods incorporating a neural network and a forward physical model for semiconductor applications |
US20180068859A1 (en) * | 2016-09-05 | 2018-03-08 | Tokyo Electron Limited | Location-Specific Tuning of Stress to Control Bow to Control Overlay In Semiconductor Processing |
-
2021
- 2021-06-27 US US17/359,626 patent/US20220415683A1/en active Pending
-
2022
- 2022-06-20 WO PCT/US2022/034138 patent/WO2023278190A1/en active Application Filing
- 2022-06-20 EP EP22833915.6A patent/EP4364190A1/en active Pending
- 2022-06-20 KR KR1020247003096A patent/KR20240027069A/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080182344A1 (en) * | 2007-01-30 | 2008-07-31 | Steffen Mueller | Method and system for determining deformations on a substrate |
US20170351952A1 (en) * | 2016-06-01 | 2017-12-07 | Kla-Tencor Corporation | Systems and methods incorporating a neural network and a forward physical model for semiconductor applications |
US20180068859A1 (en) * | 2016-09-05 | 2018-03-08 | Tokyo Electron Limited | Location-Specific Tuning of Stress to Control Bow to Control Overlay In Semiconductor Processing |
Non-Patent Citations (1)
Title |
---|
SCHONFELD EDGAR; SCHIELE BERNT; KHOREVA ANNA: "A U-Net Based Discriminator for Generative Adversarial Networks", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 13 June 2020 (2020-06-13), pages 8204 - 8213, XP033803473, DOI: 10.1109/CVPR42600.2020.00823 * |
Also Published As
Publication number | Publication date |
---|---|
US20220415683A1 (en) | 2022-12-29 |
EP4364190A1 (en) | 2024-05-08 |
KR20240027069A (en) | 2024-02-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI632627B (en) | Process-induced distortion prediction and feedforward and feedback correction of overlay errors | |
Takeishi et al. | Learning Koopman invariant subspaces for dynamic mode decomposition | |
CN110678961B (en) | Simulating near field images in optical lithography | |
CN107908071A (en) | A kind of optical adjacent correction method based on neural network model | |
CN111627799A (en) | Method for manufacturing semiconductor element | |
WO2020158646A1 (en) | Depth superresolution device, depth superresolution method, and program | |
Naimipour et al. | UPR: A model-driven architecture for deep phase retrieval | |
CN113238460B (en) | Deep learning-based optical proximity correction method for extreme ultraviolet | |
Lichtenstein et al. | Deep eikonal solvers | |
Molinari et al. | Iterative regularization for convex regularizers | |
Awad et al. | Accurate prediction of EUV lithographic images and 3D mask effects using generative networks | |
Malachivskyy et al. | Uniform approximation of functions of two variables | |
US20220415683A1 (en) | Method for determining corrective film pattern to reduce semiconductor wafer bow | |
JP6942203B2 (en) | Data processing system and data processing method | |
Shi et al. | Physics based feature vector design: a critical step towards machine learning based inverse lithography | |
Shim et al. | Machine learning for mask synthesis | |
Singh et al. | Multioutput FOSLS Deep Neural Network for Solving Allen–Cahn Equation | |
Deka et al. | Path-integral formula for computing Koopman eigenfunctions | |
Valkanas et al. | A neural network based electromagnetic simulator | |
Jia et al. | Stochastic gradient descent for robust inverse photomask synthesis in optical lithography | |
CN112712140A (en) | Method for training domain fitting model based on principal curvature approximation | |
Yang et al. | ILILT: Implicit Learning of Inverse Lithography Technologies | |
Duruisseaux et al. | An Operator Learning Framework for Spatiotemporal Super-resolution of Scientific Simulations | |
Woldeamanual et al. | Accurate prediction of EUV lithographic images and 3D mask effects using generative networks | |
Huberman et al. | Deep Accurate Solver for the Geodesic Problem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22833915 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023579408 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20247003096 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020247003096 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022833915 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022833915 Country of ref document: EP Effective date: 20240129 |