CN113537517B

CN113537517B - Defect detection model training method, device, equipment and storage medium

Info

Publication number: CN113537517B
Application number: CN202111083479.6A
Authority: CN
Inventors: 钱程浩; 黄雪峰; 熊海飞
Original assignee: Shenzhen Xinrun Fulian Digital Technology Co Ltd
Current assignee: Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-01-07
Anticipated expiration: 2041-09-16
Also published as: CN113537517A

Abstract

The invention discloses a defect detection model training method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a training picture and reference model parameters of the machine; after configuring the model parameters of the defect detection model to be trained as reference model parameters, carrying out local iterative training on the defect detection model to be trained by adopting a local training picture to obtain a local gradient value; sending the local gradient value to a service node for the service node to average the gradient values sent by each computing node, updating a reference model parameter by adopting the total gradient value, and sending the updated reference model parameter to each computing node for each computing node to serve as a reference model parameter required by the next local training; and when the convergence of the defect detection model to be trained is detected, taking the converged defect detection model to be trained as a target defect detection model. The invention realizes that the data volume of training data is increased and the model precision is improved under the condition of limited computing capacity and memory limitation of a single machine.

Description

Defect detection model training method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a defect detection model training method, a defect detection model training device, defect detection model training equipment and a storage medium.

Background

At present, the defect detection based on deep learning is applied to a plurality of fields such as metal firmware, cloth silk fabric, building cracks, steel bar cracks and the like, and good results are obtained. In order to converge the model and achieve a good generalization effect, the amount of data required by the deep learning-based defect detection model training is very large, so that data reading becomes a very time-consuming part during multiple rounds of training. And in order to make the features expressed by the deep network richer, the deeper and more complicated network structure can be designed. The larger training data volume and the model structure have higher requirements on the calculation power and the memory of the equipment, and on the contrary, the limitation on the calculation power and the memory of the equipment can cause the low model precision of the defect detection model obtained by training.

Disclosure of Invention

The invention mainly aims to provide a defect detection model training method, a defect detection model training device, defect detection model training equipment and a defect detection model storage medium, and aims to solve the technical problem that the accuracy of a defect detection model obtained through training is not high due to the limitation of computing power and memory of the defect detection model training equipment.

In order to achieve the above object, the present invention provides a defect detection model training method, which is applied to each computing node in a distributed cluster, wherein each computing node is deployed with the same defect detection model to be trained, and the method comprises the following steps:

local training pictures required by the local training are obtained from a distributed file system in the distributed cluster, and reference model parameters of the defect detection model to be trained required by the local training are obtained from service nodes in the distributed cluster;

after configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by adopting the local machine training picture to obtain local gradient values of all model parameters in the to-be-trained defect detection model of the local machine;

sending the local gradient value to the service node, so that the service node can average the received gradient values sent by the computing nodes to obtain a total gradient value, updating the reference model parameter by using the total gradient value, and distributing the updated reference model parameter to the computing nodes so that the computing nodes can use the updated reference model parameter as a reference model parameter required by the next local training;

in each local training process, when the convergence of the to-be-trained defect detection model of the local machine is detected, the converged to-be-trained defect detection model is used as a target defect detection model, and the image defect detection is carried out based on the target defect detection model.

Optionally, after configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by using the local machine training picture to obtain local gradient values of the model parameters in the to-be-trained defect detection model of the local machine, where the step of obtaining the local gradient values of the model parameters includes:

after configuring the model parameters of the defect detection model to be trained of the local machine as the reference model parameters, detecting whether the defect detection model to be trained of the local machine enters a pre-convergence state;

if the defect detection model to be trained of the local machine is determined to enter the pre-convergence state, adding a first preset round number on the basis of the round number of local iterative training in the last local training to obtain a target round number, wherein the round number of the local iterative training in the first local training is set to be 1;

and performing local iterative training of the target round number on the to-be-trained defect detection model of the local machine by using the local machine training picture to obtain a local machine gradient value of each model parameter in the to-be-trained defect detection model of the local machine.

Optionally, the step of detecting whether the defect detection model to be trained of the local machine enters a pre-convergence state includes:

detecting whether gradient change values of the defect detection model to be trained of the local machine in the historical local iterative training of the latest second preset round number are smaller than a preset value or not, wherein the gradient change values refer to the change values of gradient values obtained by calculation in the local iterative training round and compared with the gradient values obtained by calculation in the local iterative training round;

if the defect detection models to be trained are smaller than the preset value, determining that the defect detection model to be trained of the local machine enters a pre-convergence state;

if not, determining that the defect detection model to be trained of the local machine does not enter the pre-convergence state.

Optionally, after configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by using the local machine training picture to obtain local gradient values of the model parameters in the to-be-trained defect detection model of the local machine, the method further includes:

acquiring the communication bandwidth of the current local machine, and calculating to obtain the predicted communication time of the local training gradient feedback service node according to the communication bandwidth and the data volume of the local gradient value;

acquiring pre-recorded training time of single local training of the local computer, and calculating the time proportion of the estimated communication time relative to the training time;

if the duration proportion is larger than a first preset proportion, discarding the gradient value of the local computer;

and if the duration proportion is not greater than the first preset proportion, executing the step of sending the local gradient value to the service node.

configuring model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters;

inputting the training picture of the local machine into a feature extraction layer of the defect detection model to be trained of the local machine for feature extraction to obtain a feature map;

inputting the feature map into an SSD detector of the local defect detection model to be trained to obtain defect classification scores and defect position coordinates of a defect area in a local training picture;

calculating to obtain a current gradient value of a loss function of the defect detection model to be trained relative to a current model parameter in the defect detection model to be trained of the local machine according to the defect classification score and the defect position coordinate so as to complete a round of local iterative training of the local training;

detecting whether the number of rounds of local iterative training performed in the local training reaches a third preset number of rounds;

if so, taking the current gradient value as a local gradient value;

and if not, updating model parameters in the defect detection model to be trained of the local machine according to the current gradient value, and returning to execute the step of inputting the training picture of the local machine into a feature extraction layer of the defect detection model to be trained of the local machine to extract features to obtain a feature map.

Optionally, the step of obtaining a local training picture required by the local training from the distributed file system in the distributed cluster includes:

local training pictures needed by the local training are obtained from a distributed file system in the distributed cluster, wherein the distributed file system selects a single-computer training picture set corresponding to each computing node in the local training from a total training picture set, the single-computer training picture sets are correspondingly distributed to the computing nodes, and an intersection exists between the single-computer training picture sets corresponding to every two computing nodes, and the intersection ratio is not smaller than a second preset ratio.

Optionally, in each local training process, after the step of taking the converged defect detection model to be trained as the target defect detection model when it is detected that the defect detection model to be trained of the local machine converges, the method further includes:

acquiring a target picture of a defect to be detected;

inputting the target picture into the target defect detection model for defect detection to obtain the defect type and the defect position in the target picture;

and after marking the defect position in the target picture, outputting and displaying the target picture and the defect type.

In order to achieve the above object, the present invention further provides a defect detection model training apparatus, where the apparatus is deployed in each computing node in a distributed cluster, and each computing node is deployed with the same defect detection model to be trained, and the apparatus includes:

the acquisition module is used for acquiring local training pictures required by the local training from a distributed file system in the distributed cluster, and acquiring reference model parameters of the defect detection model to be trained required by the local training from service nodes in the distributed cluster;

the training module is used for configuring model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, and then performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by adopting the local machine training picture to obtain local gradient values of all model parameters in the to-be-trained defect detection model of the local machine;

a sending module, configured to send the local gradient value to the service node, so that the service node averages the local gradient values sent by the computing nodes to obtain a total gradient value after receiving the local gradient values, updates the reference model parameter by using the total gradient value, and distributes the updated reference model parameter to the computing nodes so that the computing nodes use the updated reference model parameter as a reference model parameter required for local training next time;

and the determining module is used for taking the converged defect detection model to be trained as a target defect detection model when the convergence of the defect detection model to be trained of the local machine is detected in each local training process so as to detect the image defects based on the target defect detection model.

In order to achieve the above object, the present invention also provides a defect detection model training apparatus, including: a memory, a processor, and a defect detection model training program stored on the memory and executable on the processor, the defect detection model training program when executed by the processor implementing the steps of the defect detection model training method as described above.

Furthermore, to achieve the above object, the present invention further provides a computer readable storage medium having a defect detection model training program stored thereon, which when executed by a processor implements the steps of the defect detection model training method as described above.

In the invention, a defect detection model to be trained is deployed at each computing node of a distributed cluster, each computing node acquires a local training picture required by the local training from a distributed file system in the distributed cluster, and acquires a reference model parameter of the defect detection model to be trained required by the local training from a service node in the distributed cluster; after configuring model parameters of a to-be-trained defect detection model of the local machine as reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by adopting a local machine training picture to obtain local gradient values of all model parameters in the to-be-trained defect detection model of the local machine; sending the local gradient value to a service node, so that the service node can obtain a total gradient value after receiving the gradient values sent by each computing node, updating the reference model parameter by adopting the gradient values, and distributing the updated reference model parameter to each computing node so that each computing node can adopt the updated reference model parameter as a reference model parameter required by the next local training; in each local training process, when the convergence of the to-be-trained defect detection model of the local machine is detected, the converged to-be-trained defect detection model is used as a target defect detection model, and the image defect detection is carried out based on the target defect detection model. According to the method, the defect detection model training is carried out by all the computing nodes together in a mode that all the computing nodes send the gradient values of the local computer to the service nodes for gathering after the local training is finished, and the training task of the defect detection model is expanded to be carried out by multiple computers under the condition that a single computer has limited computing capacity and limited memory, so that more training pictures can be adopted for training, the data quantity of training data is increased, and the model precision of the defect detection model obtained by training is improved.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a training method of a defect detection model according to a first embodiment of the present invention;

FIG. 3 is a functional block diagram of a training apparatus for defect detection models according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, the defect detection model training device in the embodiment of the present invention is a computing node in a distributed cluster, and may be a device such as a smart phone, a personal computer, and a server, which is not limited herein. And the same defect detection model to be trained is deployed on each computing node in the distributed cluster.

As shown in fig. 1, the defect detection model training apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in FIG. 1 does not constitute a limitation of the defect detection model training apparatus, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a defect detection model training program. The operating system is a program that manages and controls the hardware and software resources of the device, supporting the operation of the defect detection model training program and other software or programs. In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and processor 1001 may be configured to invoke a defect detection model training program stored in memory 1005 and perform the following operations:

Further, after configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by using the local machine training picture, and obtaining local gradient values of the model parameters in the to-be-trained defect detection model of the local machine includes:

Further, the detecting whether the to-be-trained defect detection model of the local computer enters a pre-convergence state comprises:

Further, after configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by using the local training picture to obtain local gradient values of the model parameters in the to-be-trained defect detection model of the local machine, and then the processor 1001 may be further configured to invoke a defect detection model training program stored in the memory 1005 to perform the following operations:

if so, taking the current gradient value as a local gradient value;

Further, the obtaining of the local training picture required by the local training from the distributed file system in the distributed cluster includes:

Further, in each local training process, after the converged defect inspection model to be trained is taken as the target defect inspection model when it is detected that the to-be-trained defect inspection model of the local machine converges, the processor 1001 may be further configured to call a defect inspection model training program stored in the memory 1005, and perform the following operations:

acquiring a target picture of a defect to be detected;

Based on the above structure, various embodiments of the defect detection model training method are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a defect detection model training method according to a first embodiment of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown. In this embodiment, the defect detection model training method is applied to each computing node in a distributed cluster, and each computing node is deployed with the same defect detection model to be trained. In this embodiment, the defect detection model training method includes:

step S10, local training pictures needed by the local training are obtained from the distributed file system in the distributed cluster, and reference model parameters of the defect detection model to be trained needed by the local training are obtained from the service nodes in the distributed cluster;

in this embodiment, the defect detection model to be trained may be a common defect detection model, which is not limited in this embodiment, the model parameters in the defect detection model to be trained are initialized according to experience or at random, and the model parameters need to be updated iteratively in multiple rounds in the training process, and the training is stopped until the defect detection model to be trained converges, so as to obtain the trained target defect detection model.

The distributed cluster comprises a plurality of computing nodes, at least one service node and at least one distributed file system. In order to solve the problem that the defect detection model for training the stand-alone equipment is limited by computing power and memory, in this embodiment, the defect detection model to be trained is deployed to each computing node in advance, that is, the same defect detection model to be trained is deployed in each computing node, and the defect detection model to be trained is trained by each computing node together.

The training process is divided into a plurality of local training, each computing node trains the defect detection model to be trained locally respectively, and each computing node summarizes the training results once at the service node after each local training is finished and serves as the basis of the next local training. Since the operations performed by the computing nodes during local training are the same, the following description will be made by taking a computing node as an example.

Specifically, after a local training is started, the computing node obtains a local training picture required by the local training from the distributed file system. It should be noted that a total training picture set is stored in the distributed file system, the total training picture set comprises a plurality of training pictures, and the training pictures comprise defective product pictures and non-defective product pictures to form positive example training pictures and negative example training pictures; a batch of training pictures are needed for local training of the computing node each time, and a batch of training pictures which are acquired by the computing node from the distributed file system and used for the local training are called local training pictures. In a specific embodiment, in a local training, local training pictures acquired by two computing nodes may be completely different or partially different (that is, an intersection is allowed); as each computing node adopts different training pictures to train the defect detection model to be trained, in one local training, the data volume of the training pictures of the defect detection model to be trained is equal to the sum of the local training pictures of each computing node, the data volume of the training pictures is greatly increased, but for a single computing node, the data volume of the training pictures is within the bearable range of the single-computer computing capability and the memory. The local training pictures acquired by one computing node during each local training can be completely different or partially different.

Further, in an embodiment, the distributed file system selects a stand-alone training picture set corresponding to each computing node in the local training from the total training picture set, and correspondingly distributes each stand-alone training picture set to each computing node. That is, when a local training is started, the distributed file system selects, for each computing node, a training picture set, that is, a stand-alone training picture set, required by the computing node in the local training, and sends the training picture set to the computing node, and for the computing node, a local training picture required by the local training is obtained from the distributed file system. The single-computer training picture sets of each computing node selected by the distributed file system meet a condition that an intersection exists between the single-computer training picture sets corresponding to every two computing nodes and the intersection ratio is not smaller than a second preset proportion. The second preset ratio is set in advance as needed, for example, to 20%. The intersection and the intersection ratio between every two single training pictures are limited to be not less than a ratio, the purpose is to enable the same training pictures to be arranged between the training picture sets adopted by each computing node during one-time local training, so that the condition that the model is difficult to converge due to overlarge gradient descending direction deviation caused by overlarge training data difference between the computing nodes is avoided, and the training success rate of the model for detecting the defects to be trained is ensured.

After one local training is started, the computing node obtains the reference model parameters of the defect detection model to be trained, which are required by the local training, from the service node. The reference model parameter means that the local training is performed based on the reference model parameter. It should be noted that, during the first local training, the service node may distribute the initialized model parameters to each computing node as the reference model parameters of the first local training.

Step S20, after configuring the model parameters of the defect detection model to be trained of the local machine as the reference model parameters, performing at least one round of local iterative training on the defect detection model to be trained of the local machine by using the training picture of the local machine to obtain local gradient values of the model parameters in the defect detection model to be trained of the local machine;

after obtaining a local training picture and a reference model parameter required by the local training, the computing node configures the model parameter of the defect model to be trained of the local machine as the reference model parameter. It should be noted that the defect detection model to be trained mentioned later is the defect detection model to be trained after the reference model parameters are configured.

And the computing node performs at least one round of local iterative training on the defect detection model to be trained of the local machine by adopting the local machine training picture. It should be noted that, in a local training process, a computing node may perform at least one local iteration training locally, a specific number of rounds may be preset, the number of rounds of local iteration training may be the same or different in each local training, and the number of rounds of local iteration training of different computing nodes may be the same or different. It should be noted that, in the specific embodiment, when the number of rounds of local iterative training is set to be small, the interaction between the computing node and the service node is frequent, so that gradient deviation of the computing node in the local training process can be avoided, and convergence of the model can be ensured, but because the interaction is frequent, the communication cost of the computing node and the service node is increased; when the number of rounds of local iterative training is set to be large, the communication cost of the computing node and the service node can be reduced, but the gradient deviation can be caused and the model cannot be converged when the number of rounds is large; through the reasonable setting of the number of local iterative training rounds, gradient deviation can be avoided, the convergence of the model can be ensured, and meanwhile, certain communication cost is reduced, for example, 10 rounds of local iterative training can be set in one local training.

The local iterative training process can be that the computing node inputs a local training picture into a to-be-trained defect detection model of the local to obtain a defect detection result, and the gradient value of a loss function of the to-be-trained defect detection model relative to each current model parameter of the to-be-trained defect detection model is obtained through calculation according to the defect detection result so as to complete a local iterative training; if the next local iterative training is not carried out any more, the gradient value is to be used as a local gradient value obtained by the local training; and if the next local iterative training is needed, updating the current model parameters of the defect detection model to be trained according to the gradient value, and performing the next local iterative training based on the defect detection model to be trained after the model parameters are updated. The loss function may be a loss function of a conventional defect detection model, and is not limited herein.

After at least one round of local iteration, the calculation node obtains the local gradient value of each model parameter in the local defect detection model to be trained. Since the gradient value obtained by local training of the local computer is different from the gradient value obtained by local training of other computing nodes, the gradient value obtained by local computing is called local gradient value.

Step S30, sending the local gradient value to the service node, so that the service node may average the local gradient values after receiving the gradient values sent by the computing nodes to obtain a total gradient value, update the reference model parameter with the total gradient value, and distribute the updated reference model parameter to the computing nodes so that the computing nodes may use the updated reference model parameter as a reference model parameter required for the next local training;

and after the local training is finished, the calculation node sends the local gradient value obtained by the local training to the service node. And the service node receives the gradient values obtained by the local training sent by each computing node, and averages the gradient values sent by each computing node to obtain a total gradient value. It should be noted that there are generally a plurality of model parameters of the defect detection model to be trained, and one model parameter corresponds to one gradient value, so that the gradient values sent to the service node by the plurality of computing nodes include a plurality of gradient values corresponding to each model parameter; the service node averages the plurality of gradient values corresponding to each model parameter to obtain a total gradient value corresponding to each model parameter.

And after the service node calculates to obtain a total gradient value, updating the reference model parameter by using the total gradient value, distributing the updated reference model parameter to each computing node, and taking the received updated reference model parameter as the reference model parameter required by the next local training by each computing node.

Step S40, in each local training process, when the defect detection model to be trained of the local machine is detected to be converged, the converged defect detection model to be trained is used as a target defect detection model, and picture defect detection is carried out based on the target defect detection model.

In each local training process, the computing node detects whether a local defect detection model to be trained is folded. Detecting whether the local defect detection model to be trained converges specifically may be that after the gradient value is obtained by the computation node in each local iteration training, detecting whether the variation of the gradient value compared with the gradient value of the previous local iteration is smaller than a preset value, if so, determining to converge, and if not, determining not to converge; or after the service node calculates the total gradient value in each local training, detecting whether the variation of the total gradient value compared with the total gradient value of the last local training is smaller than a preset value, if so, determining convergence, if not, determining non-convergence, and feeding back the result of whether convergence to the calculation node.

And when the computing node detects that the defect detection model to be trained of the local machine is converged, finishing training, and taking the converged defect detection model to be trained as a target defect detection model. After the target defect detection model is obtained, the computing node may perform image defect detection by using the target defect detection model, specifically, may take an image of an object to be detected for a defect, and input the image as the image to be detected into the target defect detection model for detection to obtain a defect detection result, for example, a result indicating whether the defect exists.

Further, in an embodiment, after the step S40, the method further includes:

step a, obtaining a target picture of a defect to be detected;

b, inputting the target picture into the target defect detection model for defect detection to obtain the defect type and the defect position in the target picture;

and c, after marking the defect position in the target picture, outputting and displaying the target picture and the defect type.

And after the target defect detection model is obtained, the computing node can acquire the target picture and detect the defects of the target picture. Specifically, the target picture may be input into the target defect detection model for defect detection, so as to obtain a defect type and a defect position in the target picture. Wherein, the defect category refers to what category of defects the defect belongs to, such as scratches, cracks, and the like; the training pictures used for training the defect detection model to be trained can be pictures which are collected in advance and contain different types of defects, so that the target defect detection model obtained through training can detect the different types of defects. The defect position refers to a position of the defect region in the picture, and may be represented by a coordinate range in the picture. It should be noted that, when no defect is detected, the defect type and the defect position are null; when a plurality of defects are detected, the defect category includes a type of each defect, and the corresponding defect location includes a location of each defect. Marking the defect position in the target picture after the defect type and the defect position of the target picture are obtained; the marking may be by marking the defect locations in the picture with colored boxes or by filling the defect locations with a conspicuous color. Outputting the target picture marked with the defect position and the defect type together for displaying; the defect type and the target picture can be output separately, or the defect type is marked at the corresponding defect position in the target picture, and then the target picture with the defect position mark and the defect type is output.

In this embodiment, by deploying the defect detection model to be trained on each computing node of the distributed cluster, each computing node obtains a local training picture required by this local training from the distributed file system in the distributed cluster, and obtains a reference model parameter of the defect detection model to be trained required by this local training from the service node in the distributed cluster; after configuring model parameters of a to-be-trained defect detection model of the local machine as reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by adopting a local machine training picture to obtain local gradient values of all model parameters in the to-be-trained defect detection model of the local machine; sending the local gradient value to a service node, so that the service node can obtain a total gradient value after receiving the gradient values sent by each computing node, updating the reference model parameter by adopting the gradient values, and distributing the updated reference model parameter to each computing node so that each computing node can adopt the updated reference model parameter as a reference model parameter required by the next local training; in each local training process, when the convergence of the to-be-trained defect detection model of the local machine is detected, the converged to-be-trained defect detection model is used as a target defect detection model, and the image defect detection is carried out based on the target defect detection model. In the embodiment, the defect detection model training is performed by all the computing nodes together in a mode that all the computing nodes send the local gradient values to the service nodes for gathering after completing the local training, and the training task of the defect detection model is expanded to be performed by multiple computers under the condition of limited computing capacity and limited memory of a single computer, so that more training pictures can be adopted for training, the data volume of training data is increased, and the model precision of the defect detection model obtained by training is improved.

Further, based on the first embodiment, a second embodiment of the defect detection model training method of the present invention is provided, in this embodiment, the step S20 includes:

step S201, after configuring the model parameters of the defect detection model to be trained of the local machine as the reference model parameters, detecting whether the defect detection model to be trained of the local machine enters a pre-convergence state;

After the reference model parameters are configured, the computing node may dynamically determine the number of rounds of local iterative training that need to be performed in the local training.

Specifically, the computing node may first detect whether the defect detection model to be trained of the local computer enters a pre-convergence state. The pre-convergence state refers to a state to be entered into a convergence state, and it may be preset to determine to enter the pre-convergence state when a certain condition is satisfied.

Step S202, if the defect detection model to be trained of the local machine is determined to enter the pre-convergence state, adding a first preset round number on the basis of the round number of local iterative training in the last local training to obtain a target round number, wherein the round number of the local iterative training in the first local training is set to be 1;

if the calculation node determines that the defect detection model to be trained enters the pre-convergence state, adding a first preset round number to the round number of local iterative training in the last local training to obtain a round number (hereinafter referred to as a target round number). The first preset round number is a round number which is preset according to needs, for example, 2, and represents each local training after entering the pre-convergence state, and the round number of the local iterative training is increased in an arithmetic progression. The number of rounds of local iterative training at the first local training is set to 1 so as to avoid gradient deviation of the local training of the computation nodes at the beginning of the training.

Further, in an embodiment, an upper limit round number may be set, and the number of rounds of local iterative training is not increased after the number of rounds of local iterative training is greater than the upper limit round number in the last local training.

Step S203, local iterative training of the target number of rounds is carried out on the to-be-trained defect detection model of the local machine by adopting the local machine training picture, and local machine gradient values of model parameters in the to-be-trained defect detection model of the local machine are obtained.

After the target round number is determined, the computing node performs local iterative training of the target round number on the to-be-trained defect detection model of the local machine by adopting the local machine training picture to obtain a local machine gradient value of each model parameter in the to-be-trained defect detection model of the local machine. That is, after each local iteration training round, the computing node detects whether the number of rounds of local iteration training completed by the local iteration training round reaches the target number of rounds, if so, the gradient value after the local iteration training round is taken as the local gradient value of the local iteration training round, and if not, the next local iteration training round is performed.

If the calculation node determines that the defect detection model to be trained does not enter the pre-convergence state, the number of rounds of local iterative training in the last local training can be used as the number of rounds of local iterative training in the current local training.

Further, in an embodiment, the step of detecting whether the defect detection model to be trained of the local machine enters a pre-convergence state in step S201 includes:

step S2011, detecting whether gradient change values of the defect detection model to be trained of the local machine in the historical local iterative training of the latest second preset round number are all smaller than a preset value, wherein the gradient change values refer to the change values of the gradient values obtained by calculation in the local iterative training round and compared with the gradient values obtained by calculation in the previous local iterative training round;

step S2012, if the defect detection models to be trained of the local machine are all smaller than the preset value, determining that the defect detection models to be trained of the local machine enter a pre-convergence state;

and step S2013, if the defect detection models to be trained of the local machine are not smaller than the preset value, determining that the defect detection models to be trained do not enter the pre-convergence state.

And the calculation node records the gradient value obtained by calculation in the calendar round of local iterative training, and calculates the change value of the gradient value obtained by calculation in the local iterative training round as compared with the gradient value obtained by calculation in the previous round of local iterative training, and the change value is used as the gradient change value in the local iterative training round. The computing node can detect whether the gradient change values in the historical local iterative training of the latest second preset round number performed by the local computer are all smaller than a preset value. The preset value can be set empirically, for example, to 1.25 e-4. It should be noted that, when it is detected whether the model converges or not, the gradient value is compared with a value (referred to as a target value for distinction), and then the target value should be smaller than the preset value, that is, smaller than the preset value and larger than the target value, the converged state is determined, and the converged state is determined after the target value is smaller than the target value. The second preset number of rounds may be preset as needed, for example, may be set to 10 rounds, that is, it is detected whether the gradient change values of the latest 10 rounds of local iterative training are all smaller than the preset value.

If the gradient change values in the historical local iterative training of the latest second preset round number are all smaller than the preset value, the calculation node determines that the defect detection model to be trained of the local machine enters the pre-convergence state, and otherwise, determines that the defect detection model to be trained of the local machine does not enter the pre-convergence state.

In the embodiment, the number of rounds of local iterative training during each local training is dynamically adjusted by the computing node according to the convergence condition of the defect detection model to be trained, when the local iterative training is in a pre-convergence state, the number of rounds of local iterative training is gradually increased, gradient deviation during the local training can be guaranteed to be avoided in the early period, the condition that the model cannot be converged occurs, and after the convergence direction of the model in the later period tends to be stable, the communication cost between the computing node and the service node can be reduced by increasing the number of rounds of local iterative training, so that the effective convergence of the model can be guaranteed in the whole training process, and the communication cost can be reduced.

Further, based on the first and/or second embodiments, a third embodiment of the defect detection model training method of the present invention is provided, in this embodiment, after step S20, the method further includes:

step S50, obtaining the communication bandwidth of the current local machine, and calculating the predicted communication time of the local training gradient backhaul service node according to the communication bandwidth and the data size of the local gradient value;

in this embodiment, after the local gradient value of the local training is obtained by calculation of the calculation node, it may be determined whether the local gradient value needs to be sent to the service node first.

Specifically, the computing node may obtain the current communication bandwidth of the local machine after obtaining the local gradient value of the local training by computing, and calculate the predicted communication duration of the local training gradient returned to the serving node according to the communication bandwidth and the data size of the local gradient value, that is, the predicted communication duration to be spent.

Step S60, obtaining the pre-recorded training time length of single local training of the local computer, and calculating the time length proportion of the estimated communication time length relative to the training time length;

if the duration proportion is greater than a first preset proportion, executing step S70, and discarding the local gradient value;

if the duration ratio is not greater than the first preset ratio, the step S30 is executed.

The computing node may record the training duration of a single local training session of the native machine. Specifically, the training duration of each local training may be averaged to serve as the training duration of a single local training of the local training machine, or the duration spent by the local training of this time may be directly used as the training duration of the single local training of the local training machine. The calculation node calculates the time length proportion of the expected communication time length relative to the training time length of the local single local training, namely the time length proportion = the expected communication time length/the training time length of the local single local training.

The computing node may detect whether the duration ratio is greater than a first preset ratio. The first preset ratio may be set empirically, for example to 1/10. When the length ratio is greater than the first preset ratio, it is indicated that the estimated communication time is closer to the time of a local single local training, if the service node waits for the computing node to send the gradient value, the computing node occupies a larger communication time, which will cause a larger influence on the overall training speed of the defect detection model to be trained, at this time, the computing node may discard the local gradient value, that is, the local gradient value is not sent to the service node, and the local training is directly participated in the next local training. The service node only collects the received gradient values. Further, if the computing node discards the local gradient value of the local training, a short signal may be sent to the service node, and the computing node is notified not to participate in the summarization of the local training result. When the time length ratio is not greater than the first preset ratio, it is indicated that the communication time length of the computing node does not greatly affect the overall training speed of the defect detection model to be trained, and at this time, the computing node may execute step S30, that is, send the local gradient value to the service node.

Further, in an embodiment, the step S20 includes:

step S204, configuring model parameters of the defect detection model to be trained of the local machine as the reference model parameters;

in this embodiment, the defect detection model to be trained may include a feature extraction layer and an SSD (single Shot multi box detector), where the feature extraction layer is used to perform feature extraction, and the SSD detector is used to perform defect classification and defect position detection based on a result of the feature extraction. The feature extraction layer may adopt a CNN network, for example, VGG16 may be adopted, and VGG16 is a depth feature extraction network of 5 large convolution layers composed of 16 small convolution layers and used for extracting feature information in a picture. Model parameters in the defect detection model to be trained are the model parameters in the feature extraction layer and the SSD detector.

Step S205, inputting the training picture of the local machine into a feature extraction layer of the defect detection model to be trained of the local machine for feature extraction to obtain a feature map;

step S206, inputting the feature map into an SSD detector of the defect detection model to be trained of the local machine to obtain defect classification scores and defect position coordinates of a defect area in a training picture of the local machine;

step S207, calculating to obtain a current gradient value of a loss function of the defect detection model to be trained relative to a current model parameter in the defect detection model to be trained of the local machine according to the defect classification score and the defect position coordinate, so as to complete a round of local iterative training of the local training;

and the computing node performs at least one round of local iterative training on the defect detection model to be trained of the local machine by adopting the local machine training picture.

The local iterative training process can be that the computing node inputs a local training picture into a feature extraction layer of a to-be-trained defect detection model of the local to perform feature extraction to obtain a feature map, and then inputs the feature map into an SSD detector of the to-be-trained defect detection model of the local to perform detection to obtain defect classification scores and defect position coordinates of defect regions in the local training picture. The defect classification score refers to a probability score of the defect region belonging to each defect category, and the defect position coordinate refers to a position coordinate of the defect region in the local training picture. And the calculation node calculates and obtains gradient values (hereinafter referred to as current gradient values for distinction) of current model parameters of the defect detection model to be trained of the local machine to be trained according to the defect classification score and the defect position coordinate so as to complete a round of local iterative training of the local training.

Step S208, detecting whether the number of rounds of local iterative training performed in the local training reaches a third preset number of rounds;

after completing one local iterative training to obtain the current gradient value, the computing node firstly detects whether the number of rounds of local iterative training performed by the local training reaches a third preset number of rounds. The third preset round number may be preset, or may be a local iteration training round number of the local training this time, which is determined according to the increment method in the second embodiment.

If yes, executing step S209 to use the current gradient value as a local gradient value;

if not, updating the model parameters in the defect detection model to be trained of the local machine according to the current gradient value, and then returning to execute the step S205.

And if the third preset number of rounds is reached, the calculation node determines that the local training is finished, and the current gradient value is used as the local gradient value of the local training. And if the number of the third preset rounds is not reached, the calculation node determines that the local training is not finished, and enters the next round of local iterative training after updating the model parameters in the defect detection model to be trained of the local machine according to the current gradient value, namely, the local machine training picture is input into the feature extraction layer of the defect detection model to be trained after the model parameters are updated by the local machine for feature extraction.

In addition, an embodiment of the present invention further provides a defect detection model training apparatus, where the apparatus is deployed in each computing node in a distributed cluster, and each computing node is deployed with a same defect detection model to be trained, and with reference to fig. 3, the apparatus includes:

an obtaining module 10, configured to obtain a local training picture required by the current local training from a distributed file system in the distributed cluster, and obtain a reference model parameter of the defect detection model to be trained required by the current local training from a service node in the distributed cluster;

the training module 20 is configured to configure the model parameters of the defect detection model to be trained of the local machine as the reference model parameters, and then perform at least one round of local iterative training on the defect detection model to be trained of the local machine by using the local machine training picture to obtain local gradient values of the model parameters in the defect detection model to be trained of the local machine;

a sending module 30, configured to send the local gradient value to the service node, so that the service node averages the gradient values sent by the computing nodes to obtain a total gradient value, updates the reference model parameter by using the total gradient value, and distributes the updated reference model parameter to the computing nodes so that the computing nodes use the updated reference model parameter as a reference model parameter required for next local training;

and the determining module 40 is configured to, in each local training process, when it is detected that the to-be-trained defect detection model of the local machine is converged, use the converged to-be-trained defect detection model as a target defect detection model, and perform image defect detection based on the target defect detection model.

Further, the training module 20 includes:

the first detection unit is used for detecting whether the defect detection model to be trained of the local machine enters a pre-convergence state after the model parameters of the defect detection model to be trained of the local machine are configured as the reference model parameters;

the first calculation unit is used for adding a first preset round number on the basis of the round number of local iterative training in the last local training to obtain a target round number if the defect detection model to be trained of the local machine is determined to enter the pre-convergence state, wherein the round number of the local iterative training in the first local training is set to be 1;

and the training unit is used for carrying out local iterative training of the target round number on the to-be-trained defect detection model of the local machine by adopting the local machine training picture to obtain the local machine gradient value of each model parameter in the to-be-trained defect detection model of the local machine.

Further, the first detection unit is further configured to:

Further, the apparatus further comprises:

the first calculation module is used for acquiring the communication bandwidth of the current local machine, and calculating the predicted communication time of the local training gradient backhaul service node according to the communication bandwidth and the data volume of the gradient value of the local machine;

the second calculation module is used for acquiring the pre-recorded training time of single local training of the local computer and calculating the time length proportion of the estimated communication time length relative to the training time length;

the discarding module is used for discarding the local gradient value if the duration proportion is greater than a first preset proportion;

the sending module 30 is further configured to send the local gradient value to the service node if the duration ratio is not greater than the first preset ratio.

Further, the training module 20 includes:

the configuration unit is used for configuring model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters;

the extraction unit is used for inputting the training picture of the local machine into a feature extraction layer of the defect detection model to be trained of the local machine to carry out feature extraction so as to obtain a feature map;

the input unit is used for inputting the feature map into an SSD detector of the to-be-trained defect detection model of the local machine to obtain defect classification scores and defect position coordinates of a defect area in the training picture of the local machine;

the second calculation unit is used for calculating to obtain a current gradient value of a loss function of the defect detection model to be trained relative to a current model parameter in the defect detection model to be trained of the local machine according to the defect classification score and the defect position coordinate so as to complete a round of local iterative training of the local training;

the second detection unit is used for detecting whether the number of rounds of local iterative training performed by the local training reaches a third preset number of rounds;

the determining unit is used for taking the current gradient value as a local gradient value if the current gradient value reaches the local gradient value;

and the extraction unit is also used for updating model parameters in the defect detection model to be trained of the local machine according to the current gradient value if the current gradient value is not reached, and then returning to execute the feature extraction layer for inputting the training picture of the local machine into the defect detection model to be trained of the local machine to perform feature extraction so as to obtain a feature graph.

Further, the obtaining module 10 includes:

the acquisition unit is used for acquiring local training pictures required by the local training from a distributed file system in the distributed cluster, wherein the distributed file system selects a single-machine training picture set corresponding to each computing node in the local training from a total training picture set, correspondingly distributes each single-machine training picture set to each computing node, and an intersection exists between the single-machine training picture sets corresponding to every two computing nodes, and the intersection ratio is not smaller than a second preset proportion.

Further, the obtaining module 10 is further configured to obtain a target picture of the defect to be detected;

the device further comprises:

the defect detection module is used for inputting the target picture into the target defect detection model for defect detection to obtain the defect type and the defect position in the target picture;

and the output module is used for marking the defect position in the target picture and then outputting and displaying the target picture and the defect type.

The specific implementation of the training apparatus for defect detection models of the present invention is basically the same as the above-mentioned embodiments of the training method for defect detection models, and is not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a defect detection model training program is stored on the storage medium, and when being executed by a processor, the defect detection model training program implements the steps of the defect detection model training method as described below.

The embodiments of the defect detection model training apparatus and the computer-readable storage medium of the present invention can refer to the embodiments of the defect detection model training method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A defect detection model training method is applied to each computing node in a distributed cluster, and the same defect detection model to be trained is deployed in each computing node, and the method comprises the following steps:

in each local training process, when the convergence of the to-be-trained defect detection model of the local machine is detected, taking the converged to-be-trained defect detection model as a target defect detection model to perform picture defect detection based on the target defect detection model;

after the step of configuring the model parameters of the to-be-trained defect detection model of the local machine as the reference model parameters, performing at least one round of local iterative training on the to-be-trained defect detection model of the local machine by using the local machine training picture to obtain local gradient values of the model parameters in the to-be-trained defect detection model of the local machine, the method further includes:

2. The method for training the defect detection model according to claim 1, wherein the step of performing at least one local iteration training on the defect detection model to be trained by using the local training picture after configuring the model parameters of the defect detection model to be trained of the local machine as the reference model parameters to obtain the local gradient values of the model parameters in the defect detection model to be trained of the local machine comprises:

3. The defect inspection model training method of claim 2, wherein the step of detecting whether the defect inspection model to be trained of the local machine enters a pre-convergence state comprises:

4. The method for training the defect detection model according to claim 1, wherein the step of performing at least one local iteration training on the defect detection model to be trained by using the local training picture after configuring the model parameters of the defect detection model to be trained of the local machine as the reference model parameters to obtain the local gradient values of the model parameters in the defect detection model to be trained of the local machine comprises:

if so, taking the current gradient value as a local gradient value;

5. The defect detection model training method of claim 1, wherein the step of obtaining local training pictures required for the local training from the distributed file system in the distributed cluster comprises:

6. The defect inspection model training method according to any one of claims 1 to 5, wherein after the step of using the converged defect inspection model to be trained as the target defect inspection model when detecting the convergence of the defect inspection model to be trained of the local machine during each local training, the method further comprises:

acquiring a target picture of a defect to be detected;

7. A defect detection model training device is characterized in that the device is deployed in each computing node in a distributed cluster, and each computing node is deployed with the same defect detection model to be trained, and the device comprises:

the determining module is used for taking the converged defect detection model to be trained as a target defect detection model when the convergence of the defect detection model to be trained of the local machine is detected in each local training process so as to detect the image defects based on the target defect detection model;

the device further comprises:

the sending module is further configured to send the local gradient value to the service node if the duration ratio is not greater than the first preset ratio.

8. A defect inspection model training apparatus, characterized by comprising: a memory, a processor, and a defect detection model training program stored on the memory and executable on the processor, the defect detection model training program when executed by the processor implementing the steps of the defect detection model training method of any of claims 1 to 6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a defect detection model training program, which when executed by a processor implements the steps of the defect detection model training method according to any one of claims 1 to 6.