CN110879994A

CN110879994A - Three-dimensional visual inspection detection method, system and device based on shape attention mechanism

Info

Publication number: CN110879994A
Application number: CN201911213392.9A
Authority: CN
Inventors: 张兆翔; 张驰; 叶阳阳
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-13

Abstract

The invention belongs to the field of computer deep reinforcement learning and pattern recognition, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism, aiming at solving the problems that the precision of a single-stage detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system. The invention comprises the following steps: representing the point cloud data by three-dimensional grid voxels; extracting features and coding a space sparse feature map; extracting different scale features after projecting to a top view; adopting a deconvolution layer merging characteristic; extracting a shape attention feature map through attention weight and a convolution coding layer; and acquiring the target category, the target position, the target size and the target direction through a target classification network and a regression positioning network. The invention uses a sampling strategy based on distance constraint and an attention mechanism based on shape prior, relieves the instability caused by uneven data distribution, improves the problem that a single-stage detector lacks shape prior, and has high precision, short time consumption, strong real-time performance and good robustness.

Description

Three-dimensional visual inspection detection method, system and device based on shape attention mechanism

Technical Field

The invention belongs to the field of deep reinforcement learning, computer vision, pattern recognition and machine learning, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism.

Background

Three-dimensional object detectors need to output reliable spatial and semantic information, i.e. three-dimensional position, orientation, occupied volume and category. Compared with two-dimensional object detection, the three-dimensional target provides more detail information, but the modeling difficulty is higher. Three-dimensional object detection typically employs distance sensors, such as laser radars, TOF cameras, stereo cameras, etc., to predict more meaningful and accurate results. Three-dimensional object detection becomes a key technology in the fields of automatic driving of automobiles, UVA, robots and the like. Most accurate three-dimensional object detection algorithms in traffic scenes are based on radar sensors, which have become the basic sensors for outdoor scene perception. And target perception in traffic scenes is a key technology of unmanned vehicles with respect to locating surrounding targets.

Lidar-based three-dimensional target detection involves two important issues. The first problem is how to generate descriptive underlying features for sparse non-uniform point clouds sampled from lidar sensors. The sampling points of the laser radar are more at the position close to the sensor and less at the position far away. The diversity distribution of the point cloud may reduce the detection performance of the detector and cause instability of the detection result. Many methods rely on manual feature extraction methods. However, the detection algorithm is not stable because the manual features do not take into account and handle the unbalanced laser point cloud distribution well. Object detection and segmentation play an extremely important role in both visual data understanding and perception. Another problem is how to efficiently encode the three-dimensional shape information to achieve better discriminant embedding. The three-dimensional object detection framework mainly comprises a single-stage detector and a two-stage detector. The single-stage detector has higher efficiency, and the two-stage detector has higher detection precision. The two-stage detector is not efficient because the region candidate network outputs the region of interest ROI that needs to be cropped. However, these cropped ROIs provide a shape prior for each detected object, and higher detection accuracy can be achieved through subsequent optimization networks. The performance of a single-stage detector is lower than that of a two-stage detector due to the lack of shape priors and subsequent optimization networks. However, for real-time systems, two-stage detectors are very time consuming. In addition, the three-dimensional shape prior is more suitable for the detection of three-dimensional targets.

Disclosure of Invention

In order to solve the above problems in the prior art, namely, the problems that the precision of a single-stage three-dimensional target detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system, the invention provides a three-dimensional visual inspection detection method based on a shape attention mechanism, which comprises the following steps:

step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;

step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;

step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;

step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;

step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;

step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.

In some preferred embodiments, in step S10, "the data to be detected is characterized by voxels based on a three-dimensional network", which is performed by:

wherein D represents the voxel representation of the laser point cloud data, x_i、y_i、z_iRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, R_iRepresenting the reflectivity of the ith point in the laser point cloud data.

In some preferred embodiments, in step S20, "obtaining the feature expression of the voxel through the feature extractor and performing sparse convolutional coding to obtain a spatial sparse feature map corresponding to the data to be processed", the method includes:

wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.

In some preferred embodiments, in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:

F_att(u,v)＝Conv_att(F_FPN(u,v))

wherein, F_att(u, v) represents the attention weight feature map corresponding to the top view feature map, F_FPN(u, v) represents a top view feature map, Conv_att() Representing the attention weight layer convolution operation.

In some preferred embodiments, in step S40, "obtaining the encoding feature map of the top view feature map through a convolutional encoding layer", the method includes:

F_en(u,v)＝Conv_en(F_FPN(u,v))

wherein, F_en(u, v) represents the coding feature map corresponding to the top view feature map, F_FPN(u, v) represents a top view feature map, Conv_en() Representing a convolutional encoding layer convolution operation.

In some preferred embodiments, in step S50, "multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain the attention feature map", the method includes:

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u,v)))

wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;

wherein [ ] represents a characteristic splicing operation.

In some preferred embodiments, the target classification network is trained by a cross entropy loss function; the cross entropy loss function is:

wherein N represents the number of samples for which loss is calculated; y is_iRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number of_iA network output value representing a sample.

In some preferred embodiments, the target regression positioning network is trained by a Smooth L1 loss function; the Smooth L1 loss function is:

where x represents the residual requiring regression.

On the other hand, the invention provides a three-dimensional visual inspection detection system based on a shape attention mechanism, which comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;

the input module is configured to acquire laser point cloud data containing a target object as to-be-detected data and represent the to-be-detected data through a voxel based on a three-dimensional network;

the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;

the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;

the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;

the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;

the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;

the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;

the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;

the output module is configured to output the acquired object type, and the object position, size and direction.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.

The invention has the beneficial effects that:

the three-dimensional visual inspection detection method based on the shape attention mechanism uses a sampling strategy based on distance constraint, can effectively relieve unstable results caused by uneven distribution of radar sampling point cloud data, solves the problem that a single-stage detector lacks shape prior through the attention mechanism based on the shape prior, can improve the detection performance of the conventional single-stage three-dimensional target detector, particularly aims at targets with obvious shape characteristics, is high in detection precision, short in detection time consumption, suitable for a real-time system and good in model robustness.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a three-dimensional visual inspection method based on a shape attention mechanism according to the present invention;

FIG. 2 is a schematic diagram of an algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;

FIG. 3 is a data set and an exemplary graph of the inspection results of one embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention;

FIG. 4 is a graph showing the comparison of the results of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention with other methods.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention discloses a three-dimensional visual inspection method based on a shape attention mechanism, which comprises the following steps:

In order to more clearly illustrate the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the following describes the steps in the embodiment of the method of the present invention in detail with reference to fig. 1.

The three-dimensional visual inspection method based on the shape attention mechanism comprises the following steps of S10-S60, wherein the steps are described in detail as follows:

step S10, laser point cloud data containing a target object is obtained as data to be detected, and the data to be detected is represented by a voxel based on a three-dimensional network, as shown in formula (1):

wherein D represents the voxel representation of the laser point cloud data, x_i、y_i、z_iRepresenting the three-dimensional position information of the ith point in the laser radar point cloud in the laser point cloud data, R_iRepresenting the reflectivity of the ith point in the laser point cloud data.

Assuming that the lidar point cloud includes a three-dimensional space of H, W, D, representing the height in the vertical direction, the position in the horizontal direction, and the distance, respectively, the size of each voxel is Δ H × Δ W × Δ D, Δ H ═ 0.4m, Δ W ═ 0.2m, and Δ D ═ 0.2 m. The size of the voxel grid in the whole three-dimensional space can be calculated by H/delta H, W/delta W, D/delta D. Each voxel is then characterized by a feature encoding layer (VFE). In one embodiment of the invention, the feature extractor describes the sample points in each voxel using 7-dimensional vectors (three-dimensional coordinates, reflectivity, and relative three-dimensional coordinates of the voxel, respectively), and adds to each sample the coordinate (P) of the current pillar center_x,P_y). At this time, the description vector of the sample point in each voxel becomes 9 dimensions. In one embodiment of the invention, the feature encoding layer (VFE) includes a linear layer, a batch normalization layer (BN), and a corrected linear unit layer (ReLU) to extract vector features of points.

Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed, as shown in formula (2):

And step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map.

Space sparse feature map f_s(x, y, z) is projected to a top view (namely a bird's eye view), namely a space sparse feature map f_s(x, y, z) vertical dimension compression to obtain a characteristic diagram f of a top view_2D(u, v). Specifically, assuming that the original feature is (C, D, H, W), the height feature is incorporated into the feature channel to become (C × D, H, W), and a feature map in which the 2D convolution feature is a top view is obtained. Obtaining f by a characteristic pyramid convolution network_2D(u, v) features of different scales, and combining the features of different scales through the deconvolution layer to obtain a feature map f_FPN(u, v). In one embodiment of the present invention, the feature pyramid convolutional layer comprises three convolutional groups, each having (3, 5) convolutional layers, each of which is followed by a batch normalization layer (BN), a corrected linear unit layer (ReLU).

Step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; and acquiring the coding characteristic diagram of the top view characteristic diagram through a convolution coding layer.

Acquiring an attention weight characteristic diagram of the top view characteristic diagram through an attention weight layer, wherein the formula (3) is as follows:

F_att(u,v)＝Conv_att(F_FPN(u, v)) formula (3)

Acquiring a coding feature map of the top view feature map through a convolution coding layer, wherein the formula (4) is as follows:

F_en(u,v)＝Conv_en(F_FPN(u, v)) formula (4)

Step S50, multiplying the attention weight feature map to the corresponding region of the coding feature map, and performing feature concatenation to obtain an attention feature map, as shown in equations (5) and (6):

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u, v))) formula (5)

wherein [ ] represents a characteristic splicing operation.

As shown in fig. 2, the schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention is divided into three parts: the first part is a Distance-based Voxel Generator (Distance-based Voxel Generator) that transforms the input lidar point cloud into voxels; the second part is a Feature extraction layer (features extraction layers) for coding voxel features and coding three-dimensional space features; the third part is Attention area recommendation network (Attention RPN), and the Attention mechanism is injected to output the detection result.

The target classification network is trained through a cross entropy loss function, wherein the cross entropy loss function is shown as a formula (7):

The target regression positioning network is trained by a Smooth L1 loss function, and the Smooth L1 loss function is shown as a formula (8):

where x represents the residual of the regression.

Attention profile F_hybridAnd (u, v) respectively connecting a target classification network and a target regression positioning network, wherein the target classification network is used for judging whether the detection object is a target, and the target regression positioning network is used for acquiring the position, the size and the direction of the detection object.

In one embodiment of the invention, for the car in the target classification task, setting the intersection ratio (IOU) of the anchor point and the target to be greater than 0.6 as a positive sample, and setting the intersection ratio to be less than 0.45 as a negative sample; for classes pedestrian and cyclist, a positive sample is taken when the intersection ratio (IOU) of anchor point and target is greater than 0.5, and a negative sample is taken when the intersection ratio is less than 0.35. For the regression positioning task, setting the width multiplied by the length multiplied by the height of a predefined anchor point corresponding to a target vehicle to be (1.6 multiplied by 3.9 multiplied by 1.5) meters; the width x length x height of the predefined anchor point for the target pedestrian is (0.6 x 0.8 x 1.73) meters; the width x length x height of the predefined anchor point for the target rider is (0.6 x 1.76 x 1.73) meters. Defining a three-dimensional real bounding box as x_g,y_g,z_g,l_g,w_g,h_g,θ_gWherein x, y and Z are the central positions of the bounding box, l, w and h represent the length, width and height of the three-dimensional target, and theta is the rotation angle of the target in the Z-axis direction_gRepresenting true values by_aAnd expressing the anchor point of the positive sample, expressing the corresponding residual error by delta, and predicting the position, the size and the direction of the real three-dimensional target through network learning. Residual error of central position of bounding box(Δ x, Δ y, Δ Z), a residual (Δ l, Δ w, Δ h) of the length and width of the three-dimensional target, and a residual (Δ θ) of the rotation angle of the target in the Z-axis direction are respectively expressed by the following equations (9), (10), and (11):

Δθ＝sin(θ_g-θ_a) Formula (11)

To illustrate the effectiveness of the invention in detail, the method proposed by the invention is applied to the public driverless data set KITTI, which contains 3 validation classes. As shown in fig. 3, which is an exemplary diagram of a data set and a detection result of an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, a first column Car represents a detection result of a vehicle, a second column Pedestrian represents a detection result of a Pedestrian, and a third column Cyclist represents a detection result of a rider. Each column has three groups of experimental results, each group comprises an RGB image and a top view of the radar, and the detection results are projected on the images.

In one embodiment of the invention, for the KITTI data set, the train data set is used for training, and the test data set is used for testing. As shown in fig. 4, which is a comparison graph of the detection results of the method of the present invention and other methods according to an embodiment of the three-dimensional visual inspection method based on the shape attention mechanism of the present invention, the data set is divided into three grades for each type of test object: easy, medium and difficult. The difficulty is divided according to the height of each target in the camera image, the occlusion level and the truncation degree. The height of the sample bounding box with easy difficulty is more than 40 and equal to each pixel, the maximum truncation is 15 percent, and the shielding level is completely visible; the height of a sample boundary frame with the difficulty is more than or equal to 25 pixels, the maximum truncation is 30%, and the shielding level is partial shielding; the height of the sample boundary frame with difficulty is more than or equal to 25 pixels, the maximum truncation is 50%, and the shielding level is difficult to see. BEV represents top view detection results and 3D represents detection results of a three-dimensional bounding box. The 3D target detection performance was evaluated using the PASCAL standard (AP, average accuracy). In the comparison method, ARPNET is used for representing the method, MV3D represents a multi-view 3D target detection method, ContFuse represents a depth continuous fusion multi-sensor 3D target detection method, AOVD represents multi-view aggregation data to realize a 3D object real-time detection method in an unmanned scene, F-PointNet represents a viewing cone point cloud network RGB-D data 3D object detection method, SECOND represents a sparse embedded convolution target detection method, and Voxelnet represents a point cloud data 3D target detection method based on end-to-end learning.

The three-dimensional visual inspection detection system based on the shape attention mechanism comprises an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the three-dimensional visual inspection system based on the shape attention mechanism provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional visual inspection method based on the shape attention mechanism.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A three-dimensional visual inspection method based on a shape attention mechanism is characterized by comprising the following steps:

2. The three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S10, "the data to be inspected is characterized by voxels based on three-dimensional network", which is performed by:

3. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S20, "obtaining the feature expression of the voxel by a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed" includes:

4. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:

F_att(u,v)＝Conv_att(F_FPN(u,v))

5. A three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S40, "obtaining the encoding feature map of the top view feature map by convolution encoding layer" comprises:

F_en(u,v)＝Conv_en(F_FPN(u,v))

6. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S50, the method comprises the steps of multiplying the attention weight feature map to the corresponding region of the coding feature map and performing feature matching to obtain the attention feature map, and comprises the steps of:

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u,v)))

wherein [ ] represents a characteristic splicing operation.

7. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the object classification network is trained by cross entropy loss function; the cross entropy loss function is:

8. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the target regression positioning network is trained by Smooth L1 loss function; the Smooth L1 loss function is:

where x represents the residual of the regression.

9. A three-dimensional visual inspection detection system based on a shape attention mechanism is characterized by comprising an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;

the input module is configured to acquire laser point cloud containing target object data as to-be-detected data, and the to-be-detected data is represented by voxels based on a three-dimensional network;

10. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for three-dimensional visual inspection based on the shape attention mechanism of any one of claims 1 to 8.

11. A treatment apparatus comprises

A processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

wherein the program is adapted to be loaded and executed by a processor to perform:

the three-dimensional visual inspection method based on the shape attention mechanism as set forth in any one of claims 1 to 8.