CN114004963A

CN114004963A - Target class identification method and device and readable storage medium

Info

Publication number: CN114004963A
Application number: CN202111652406.4A
Authority: CN
Inventors: 艾国; 杨作兴; 房汝明; 向志宏
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-02-01
Anticipated expiration: 2041-12-31
Also published as: CN114004963B

Abstract

The embodiment of the invention provides a target class identification method, a target class identification device and a readable storage medium. The method comprises the following steps: performing feature extraction on an image to be recognized to obtain a first feature vector; finding each first sub-feature vector corresponding to each foreground and background area in the first feature vectors; performing first interpolation processing on each first sub-feature vector to obtain each second sub-feature vector; performing target detection according to the second sub-feature vectors to obtain target areas; finding out each first sub-feature vector corresponding to each target area in the first feature vectors, and splicing each first sub-feature vector into a third feature vector; performing self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector; and identifying the target category according to the fifth feature vector to obtain the target category contained in the image. The embodiment of the invention further refines the granularity of target class identification.

Description

Target class identification method and device and readable storage medium

Technical Field

The embodiments of the present invention relate to the field of image processing technologies, and in particular, to a method and an apparatus for identifying a target class, a readable storage medium, and a computer program product.

Background

In many scenarios, the objects need to be classified for various purposes. And for different target categories with similar morphology and texture, the general image classification method is difficult to distinguish. For example: in the underground pipeline construction scene, in recent years, the urbanization process of China develops rapidly, with hundreds of millions of people flowing into the city, the pressure born by the underground pipeline is further aggravated, the underground pipeline construction is used as a very important basic task in the city construction process, the stability of the normal operation of the city is influenced, a pipe network system is overhauled in time, and the urban infrastructure construction stability is guaranteed.

At present, most of the methods for detecting the defects of the underground pipelines are that video data are shot through a robot in a well, then the pipeline sections with the defects are determined through the acquired massive information through manual screening, and after a defect type report is generated, workers overhaul the pipeline according to the reported defect types. However, there are two disadvantages to the manual determination of the type of pipe defect, the first: people who can screen pipe defect types must have certain expertise, which limits the amount of manpower that can be put into use; secondly, the method comprises the following steps: the human power may be periodically tired during the work, thereby causing a reduction in work efficiency. The superposition of the two defects causes that the urban pipeline maintenance work is difficult to be carried out in time.

At present, simple drainage pipeline damage categories (deposition, cracks, tree roots and the like) are classified by using a neural network, and the classification cannot be popularized to more pipeline defect category identification because the classification cannot selectively capture information useful for defect category differentiation.

In order to be able to identify more pipeline defect classes, another existing solution is to use a deep convolutional neural network: on the basis of a VGG (Visual Geometry Group) 16, a CBAM (Convolutional Block Attention Module) is inserted into each convolution Module to calculate an Attention moment array of a channel and a space level respectively, and then the Attention moment array is multiplied by original features, so that the purpose of enabling a network pipeline to have discriminant force features is achieved, and therefore, the classification of the defects of the pipeline before can only identify three types of defects is popularized to the extent of identifying seven types of defects such as deformation, corrosion, scaling, dislocation, deposition, leakage and fracture. Although relatively more versatile than previous methods, this method still has several drawbacks:

1) according to introduction of the published industry standard urban drainage pipeline detection and evaluation technical regulation of China ministry of urban and rural construction of housing, 17 types of defects of underground drainage pipelines (hidden connection, deformation, misconnection, residual wall, penetration, corrosion, scum, scaling, undulation, tree root, disjointing, shedding, obstacle, stagger, deposition, leakage and fracture) exist. The method can only identify seven most common defect types, so that the method cannot be widely popularized, and on the other hand, the aim of reducing the labor input cannot be achieved. Since even though this method can achieve extremely high recognition accuracy among these seven defects, it cannot be guaranteed whether the remaining 10 defects are considered normal, and these other ten defects are not negligible in drain line repair. Therefore, the data after the model detection must be screened again by manpower.

2) Although the attention mechanism of the channel and the space level is adopted, the important characteristics of the sample expression can be extracted by the neural network, so that the purpose of distinguishing the defect types with high confusion among the classes can be achieved. But the powerful fitting power of neural networks comes from its thousands of neurons, and the computation of inserting channels and spatial level attention mechanisms in each convolutional layer will greatly exacerbate the computational cost of the model.

3) CBAM is essentially a self-attentive mechanism that autonomously selects to suppress or enhance certain feature information on a current a priori basis. However, the prior information provided by the labeling information of image classification during feature selection is limited, and when more and more complex pipeline defect categories need to be identified, the CBAM mechanism is difficult to focus on truly distinguishing features.

In addition, the academic community provides a lot of meaningful exploration references for the classification task of the fine-grained images, and the methods are all used for distinguishing the classes with high confusion degree among the classes based on a mode of detection before classification. That is, the positions of the sub-regions with the differentiated granularity in the image are detected, and then the fine classification is performed based on the characteristics of the regions. However, the fine-grained classification method of first detection and then classification is not suitable for modeling of the universal pipeline defect classification method, because one of the main properties of the pipeline defects is that the defect distribution area is wide but the actual area occupation ratio of the defect region is low (such as cracks of the pipeline wall, tree roots and the like); furthermore, classification based on features of only locally distinguishable regions is not conducive to distinguishing classes that are relatively distant between classes.

Disclosure of Invention

The embodiment of the invention provides a target class identification method, a target class identification device, a readable storage medium and a computer program product, which are used for refining class granularity of target class identification and improving identification precision of the target class identification.

The technical scheme of the embodiment of the invention is realized as follows:

a method of object class identification, the method comprising:

performing feature extraction on an image to be recognized to obtain a first feature vector;

detecting each foreground region and each background region in the image according to the first feature vector;

searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the image;

respectively carrying out first interpolation processing on each first sub-feature vector corresponding to each foreground region and each background region to obtain each second sub-feature vector with a first fixed size;

performing target detection according to the second sub-feature vectors to obtain target areas in the image;

searching each first sub-feature vector corresponding to each target area in the first feature vectors according to the corresponding area of each target area in the image, and splicing each first sub-feature vector corresponding to all target areas into a third feature vector;

performing self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; the dimensionality of the fourth feature vector is the same as that of the third feature vector;

superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector;

and identifying the target category according to the fifth feature vector to obtain the target category contained in the image.

The feature extraction of the image to be recognized to obtain a first feature vector comprises the following steps:

inputting an image to be identified into a backbone network of a neural network for feature extraction;

detecting each foreground region and each background region in the image according to the first feature vector, including:

inputting first feature vectors into a region suggestion network of the neural network to detect respective foreground regions and respective background regions in the image;

the performing target detection according to the second sub-feature vectors includes:

inputting the second sub-feature vectors into a target detection network of the neural network for target detection;

the classifying and identifying the target according to the fifth feature vector comprises the following steps:

and inputting the fifth feature vector into a target classification network of the neural network for target class identification.

The backbone network, the area suggestion network, the target detection network and the target classification network of the neural network are obtained through the following training processes:

acquiring a training image set, and labeling each labeling target area and a corresponding labeling target category in each frame of training image;

sequentially taking out a frame of training image from the training image set and inputting the frame of training image into a backbone network of the neural network for feature extraction to obtain a first feature vector of the input training image;

inputting a first feature vector into a region suggestion network of the neural network to detect each foreground region and each background region in an input training image;

searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the input training image;

inputting the second sub-feature vectors into a target detection network of the neural network to obtain detection target areas and detection target types of the detection target areas in the input training image;

calculating by adopting a preset first loss function according to each detection target area and the detection target category of each detection target area obtained by the target detection network, each labeled target area labeled in the input training image and the corresponding labeled target category to obtain a first prediction deviation;

searching each first sub-feature vector corresponding to each detection target area in the first feature vector according to the corresponding area of each detection target area in the input training image, and splicing each first sub-feature vector corresponding to all detection target areas into a third feature vector;

inputting the fifth feature vector into a target classification network of the neural network to obtain each detection target category contained in the input training image;

calculating by adopting a preset second loss function according to each detection target category contained in the input training image obtained by the target classification network and the labeled target category of each labeled target area labeled in the input training image to obtain a second prediction deviation;

carrying out weighted summation on the first prediction deviation and the second prediction deviation, and adjusting parameters of the neural network according to the weighted summation;

when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

After the first sub-feature vectors corresponding to all the target regions are spliced into the third feature vector, and before the fourth feature vector and the third feature vector are superposed, the method further includes:

performing self-attention mechanism enhancement processing on the third feature vector to obtain a self-attention coefficient of each feature value in the third feature vector, and multiplying each feature value in the third feature vector by the self-attention coefficient to obtain a self-attention mechanism enhancement feature vector of the third feature vector;

the superimposing the fourth feature vector and the third feature vector includes:

and superposing the fourth feature vector and the self-attention mechanism enhanced feature vector of the third feature vector.

The method for labeling each labeling target area and the corresponding labeling target category in each frame of training image further comprises the following steps:

labeling the outline of each labeling target in each frame of training image;

after finding each foreground region and each first sub-feature vector corresponding to each background region in the first feature vector, and before performing weighted summation on the first prediction bias and the second prediction bias, the method further includes:

respectively carrying out second interpolation processing on the first sub-feature vectors corresponding to the foreground regions and the background regions to obtain sixth sub-feature vectors with a second fixed size;

inputting the sixth sub-feature vectors into a semantic segmentation network of the neural network to obtain the outline and the class of each detection target in the input training image;

calculating by adopting a preset third loss function according to the contour and the detection target category of each detection target in the input training image and the contour and the labeling target category of each labeling target labeled in the input training image obtained by the semantic segmentation network to obtain a third prediction deviation;

the weighted summation of the first prediction bias and the second prediction bias comprises:

and carrying out weighted summation on the first prediction deviation, the second prediction deviation and the third prediction deviation.

The image to be identified is a pipeline image, and the target category is a pipeline defect category.

The pipeline defect category comprises one or any combination of the following: blind joint, deformation, misconnection, wall residue, penetration, corrosion, scum, scaling, undulation, tree root, disjointing, shedding, obstacle, stagger, deposition, leakage, rupture.

An object class identification apparatus, the apparatus comprising:

the characteristic extraction module is used for extracting characteristics of the image to be identified to obtain a first characteristic vector;

the region suggestion module is used for detecting each foreground region and each background region in the image to be identified according to the first feature vector;

the region-of-interest alignment module is used for searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the image to be identified, and performing first interpolation processing on each first sub-feature vector corresponding to each foreground region and each background region respectively to obtain each second sub-feature vector with a first fixed size; searching each first sub-feature vector corresponding to each target area in the first feature vectors according to the corresponding area of each target area in the image detected by the target detection module, and splicing each first sub-feature vector corresponding to all the target areas into a third feature vector;

the target detection module is used for carrying out target detection according to the second sub-feature vectors to obtain each target area in the image to be identified;

the self-adaptive global average pooling processing module is used for carrying out self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; the dimensionality of the fourth feature vector is the same as that of the third feature vector;

the feature fusion module is used for superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector;

and the category identification module is used for identifying the target category according to the fifth feature vector to obtain the target category contained in the image to be identified.

An object class recognition neural network training apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a training image set and marking each marking target area and the corresponding marking target category in each frame of training image;

the characteristic extraction module is used for sequentially extracting a frame of training image from the training image set and inputting the frame of training image into a backbone network of the neural network for characteristic extraction to obtain a first characteristic vector of the input training image;

the region suggestion module is used for inputting the first feature vector into a region suggestion network of the neural network so as to detect each foreground region and each background region in the input training image;

the region-of-interest alignment module is used for searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the input training image, and performing first interpolation processing on each first sub-feature vector corresponding to each foreground region and each background region respectively to obtain each second sub-feature vector with a first fixed size; searching each first sub-feature vector corresponding to each target area in the first feature vectors according to the corresponding area of each target area in the input training image obtained by the target detection module, and splicing each first sub-feature vector corresponding to all target areas into a third feature vector;

the target detection module is used for inputting the second sub-feature vectors into a target detection network of the neural network to obtain detection target areas and detection target types of the detection target areas in the input training image;

the class identification module is used for inputting a fifth feature vector into a target classification network of the neural network to obtain each detection target class contained in the input training image;

the adjusting module is used for calculating by adopting a preset first loss function according to the detection target areas and the detection target types of the detection target areas obtained by the target detection module, the marking target areas marked in the input training image and the corresponding marking target types to obtain a first prediction deviation; calculating by adopting a preset second loss function according to each detection target category contained in the input training image and each labeled target category of each labeled target area labeled in the input training image obtained by the category identification module to obtain a second prediction deviation; carrying out weighted summation on the first prediction deviation and the second prediction deviation, and adjusting parameters of the neural network according to the weighted summation; when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

A non-transitory computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps of the target class identification method of any one of the above.

In the embodiment of the present invention, after the local features (third feature vectors) corresponding to each target region in the image and the global features (fourth feature vectors) extracted from the image are superimposed and fused, the target category is identified, thereby implementing: the method can distinguish the categories with high confusion degree among the categories and the categories with large distance among the categories, thereby refining the category range of target category identification and improving the identification precision of the target category identification.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a target class identification method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a target class identification method according to another embodiment of the present invention;

FIG. 3 is a flowchart of a method for training a neural network for target class recognition according to an embodiment of the present invention;

FIG. 4 is an original training image acquired: a schematic of a pipeline image;

FIG. 5 is a schematic illustration of a defect of FIG. 4 being labeled;

fig. 6 is a schematic structural diagram of an object class identification apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a target class recognition neural network training device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a flowchart of a target class identification method according to an embodiment of the present invention, which includes the following specific steps:

step 101: and performing feature extraction on the image to be recognized to obtain a first feature vector.

Step 102: and detecting each foreground region and each background region in the image according to the first feature vector.

One foreground region corresponds to one connected region where the target is located, and the remaining connected regions respectively correspond to one background region. For example: if the image to be identified is a pipeline image and the target category to be identified is a category of pipeline defects, each connected region where the defects are located in the pipeline image is a foreground region.

Step 103: and searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the image.

Each eigenvalue in the first eigenvector corresponds to a region of the image to be recognized, that is, each eigenvalue is used for describing a region of the image to be recognized. Therefore, for each foreground region or background region, according to its position in the image to be recognized, the corresponding part of feature values can be found in the first feature vector, and since the part of feature values actually belongs to the sub-feature vectors of the first feature vector, the part of feature values is referred to as the corresponding first sub-feature vector.

Step 104: and respectively carrying out first interpolation processing on the first sub-feature vectors corresponding to each foreground area and each background area to obtain second sub-feature vectors with a first fixed size.

Step 105: and carrying out target detection according to the second sub-feature vectors to obtain each target area in the image.

Step 106: and finding each first sub-feature vector corresponding to each target area in the first feature vectors according to the corresponding area of each target area in the image, and splicing each first sub-feature vector corresponding to all the target areas into a third feature vector.

For each target area, according to the position of the target area in the image to be recognized, the part of the feature value corresponding to the target area can be found in the first feature vector, and the part of the feature value is called as a first sub-feature vector corresponding to the target area.

Step 107: performing self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; and the dimension of the fourth feature vector is the same as that of the third feature vector.

Step 108: and superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector.

Step 109: and carrying out target classification and identification according to the fifth feature vector to obtain a target class contained in the image.

In the above embodiment, after the local features (third feature vectors) corresponding to each target region in the image and the global features (fourth feature vectors) extracted from the image are superimposed and fused, the target class identification is performed, so that the following is achieved: the method can distinguish the categories with high confusion degree among the categories and the categories with large distance among the categories, thereby refining the category range of target category identification and improving the identification precision of the target category identification.

In an alternative embodiment, steps 101, 102, 105 and 109 may be implemented by a neural network, which is mainly composed of a backbone network, a regional recommendation network (RPN), an object detection network and an object classification network.

Fig. 2 is a flowchart of a target class identification method according to another embodiment of the present invention, which includes the following specific steps:

step 201: and inputting the image to be identified into a backbone network of a neural network for feature extraction to obtain a first feature vector.

The backbone network may employ a ResNet50 structure.

Step 202: the first feature vectors are input into a region suggestion network (RPN) of the neural network to detect respective foreground regions and respective background regions in the image.

Step 203: and searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the image.

Step 204: and respectively carrying out first interpolation processing on the first sub-feature vectors corresponding to each foreground area and each background area to obtain second sub-feature vectors with a first fixed size.

Step 205: and inputting each second sub-feature vector into a target detection network of the neural network for target detection to obtain each target area in the image.

Step 206: and finding each first sub-feature vector corresponding to each target area in the first feature vectors according to the corresponding area of each target area in the image, and splicing each first sub-feature vector corresponding to all the target areas into a third feature vector.

Step 207: performing self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; and the dimension of the fourth feature vector is the same as that of the third feature vector.

Step 208: and superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector.

Step 209: and inputting the fifth feature vector into a target classification network of the neural network for target classification and identification to obtain a target class contained in the image.

In an alternative embodiment, after step 106, before step 108, or after step 206, before step 208, further comprising: performing self-attention mechanism enhancement processing on the third feature vector to obtain a self-attention coefficient of each feature value in the third feature vector, and multiplying each feature value in the third feature vector by the self-attention coefficient to obtain a self-attention mechanism enhancement feature vector of the third feature vector;

in step 108 or step 208, the superimposing the fourth feature vector and the third feature vector includes: and superposing the fourth feature vector and the self-attention mechanism enhanced feature vector of the third feature vector.

The value range of the self-attention coefficient of each feature value is [ 0, 1 ], and the self-attention mechanism enhancement processing is an existing algorithm and is not described herein.

In the above embodiment, the feature value corresponding to the target region in the image may be enhanced by calculating the self-attention coefficient of each feature value in the third feature vector, so as to improve the accuracy of the final target class identification.

Fig. 3 is a flowchart of a method for training a neural network for target class recognition according to an embodiment of the present invention, which includes the following specific steps:

step 301: and acquiring a training image set, and labeling each target area and the corresponding target class in each frame of training image.

In order to distinguish the target areas from the detection target areas in the subsequent step 306, each target area labeled in the step 301 is referred to as a labeled target area; in order to distinguish from the detection target class in the

subsequent steps

306 and 311, the target class labeled in this step 301 is referred to as a labeled target class.

Here, the labeling target area is usually represented by a rectangular box, which is essentially the position of the labeling target area, and the position of the target area is usually described by the top left vertex or center point of the rectangular box.

For example: when the pipeline defect category is to be identified, pipeline images are collected to form a training image set.

FIG. 4 is an original training image acquired: a schematic of a pipe image in which the gray circles are defects (here the pipe defect classes are identified).

FIG. 5 is a schematic diagram of the defect of FIG. 4, wherein the dashed rectangle is the defect frame (i.e. the smallest rectangle containing a defect), and the location of the defect is essentially marked.

The black circle in fig. 5 is a mask (i.e., a minimum connected region formed by each pixel point on the defect) of the region corresponding to the defect, and if the mask can be known about the contour of the defect, the mask is labeled, i.e., the contour of the defect is labeled.

Step 302: and sequentially taking out a frame of training image from the training image set and inputting the frame of training image into a backbone network of the neural network for feature extraction to obtain a first feature vector of the input training image.

Step 303: the first feature vectors are input to a region suggestion network (RPN) of the neural network to detect respective foreground regions and respective background regions in the input training image.

Each foreground and background area is represented by a rectangular box.

Step 304: and searching each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector according to the corresponding region of each foreground region and each background region in the input training image.

Each eigenvalue in the first eigenvector corresponds to a region in the input training image (the region is composed of a plurality of pixel points), and according to which region each foreground frame or background frame is located in the training image (namely, the region of the training image to which the rectangle corresponding to the foreground frame or background frame can be mapped), the corresponding sub-eigenvector can be found in the first eigenvector.

Step 305: and respectively carrying out first interpolation processing on the first sub-feature vectors corresponding to each foreground area and each background area to obtain second sub-feature vectors with a first fixed size.

The first interpolation processing may be bidirectional linear interpolation processing, and specifically, which interpolation algorithm is used is not limited in this embodiment.

The first fixed size can be set according to the requirement, and this embodiment does not limit this, for example, set to 7 × 7.

Step 306: and inputting each second sub-feature vector into a target detection network of the neural network to obtain each detection target area and the detection target category of each detection target area in the input training image.

And detecting the target area, namely, the target area detected by the target detection network of the neural network in the input training image.

The detection target area is represented by a detection target frame, i.e., a minimum rectangular frame containing the detected target, and the position of the detection target area is usually described by the top left vertex or the center point of the rectangular frame.

Step 307: and calculating by adopting a preset first loss function according to each detection target area and the detection target type of each detection target area and each labeled target area and labeled target type labeled in the input training image to obtain a first prediction deviation.

Here, since the object detection network outputs two parameters for each object: the loss function calculation is performed on the two parameters, which may be the same or different, for example: and adopting smooth _ L1_ loss (smooth L1 loss) function for the detection target area, adopting cross entropy function for the detection target category, and finally adding the prediction deviation corresponding to the detection target area and the prediction deviation corresponding to the detection target category to obtain a first prediction deviation.

Step 308: and searching each first sub-feature vector corresponding to each detection target area in the first feature vector according to the corresponding area of each detection target area in the input training image, and splicing each first sub-feature vector corresponding to all detection target areas into a third feature vector.

Step 309: performing self-adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; and the dimension of the fourth feature vector is the same as that of the third feature vector.

The adaptive global average pooling process is an existing mature algorithm and is not described herein.

Step 310: and superposing the fourth feature vector and the third feature vector to obtain a fifth feature vector.

Step 311: and inputting the fifth feature vector into a target classification network of the neural network to obtain a detection target class contained in the input training image.

Step 312: and calculating by adopting a preset second loss function according to the detection target class obtained by the target classification network and each labeled target class labeled in the input training image to obtain a second prediction deviation.

The second loss function may use a cross entry function.

Step 313: and carrying out weighted summation on the first prediction deviation and the second prediction deviation, and adjusting the parameters of the neural network according to the weighted summation.

For example: a SGD (Stochastic Gradient Descent) algorithm may be used to adjust parameters of the neural network based on the weighted sum.

Step 314: when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

In an optional embodiment, after step 308 and before step 310, further comprising: performing self-attention mechanism enhancement processing on the third feature vector to obtain a self-attention coefficient of each feature value in the third feature vector, and multiplying each feature value in the third feature vector by the self-attention coefficient to obtain a self-attention mechanism enhancement feature vector of the third feature vector;

in step 310, the superimposing the fourth feature vector and the third feature vector includes:

In an optional embodiment, in step 301, further labeling the contour and the category of each labeled target in each frame of training image, and further including, after "finding each first sub-feature vector corresponding to each detected target region in the first feature vector" in step 308 and before "performing weighted summation on the first prediction bias and the second prediction bias" in step 313:

respectively carrying out second interpolation processing on each first sub-feature vector to obtain each sixth sub-feature vector with a second fixed size; inputting each sixth sub-feature vector into a semantic segmentation network of a neural network to obtain the outline and the category of each detection target in the input training image; calculating by adopting a preset third loss function according to the contour and the category of each detection target in the input training image obtained by the semantic segmentation network and the contour and the category of each labeled target labeled in the input training image to obtain a third prediction deviation; here, the third loss function may be a cross entry function. The second interpolation processing may be bidirectional linear interpolation processing, and specifically, which interpolation algorithm is used is not limited in this embodiment. The second fixed size can be set according to the requirement, and this embodiment does not limit this, for example, set to 13 × 13. And marking the contour of the target as the real contour of the target.

In step 313, the weighted summation of the first prediction bias and the second prediction bias includes: and carrying out weighted summation on the first prediction deviation, the second prediction deviation and the third prediction deviation.

In the embodiment, by adding the semantic segmentation network, the characteristics predicted by the neural network are more favorable for the target position information, so that the function of accurately inhibiting the background information and enhancing the target information can be achieved, and accurate prior information is provided for the subsequent process.

In practical application, the collected partial images can be put into a verification image set, when the neural network converges, the converged neural network is verified by adopting the verification image set, and if the verification effect does not meet the requirement, the neural network is retrained until the verification effect meets the requirement by changing the structure of each sub-network in the neural network and the like. Typically, the verification image set is 1/4 the size of the training image set.

The image to be identified in the embodiment of the invention can be a pipeline image, and the corresponding target category is a pipeline defect category.

The pipeline defect category in the embodiment of the invention can comprise one or any combination of the following: blind joint, deformation, misconnection, wall residue, penetration, corrosion, scum, scaling, undulation, tree root, disjointing, shedding, obstacle, stagger, deposition, leakage, rupture. Of course, the classification of pipe defects in the embodiments of the present invention is not limited thereto, and other pipe defects or classification of defects similar to pipes are covered by the scope of the present claims.

Fig. 6 is a schematic structural diagram of an object class identification apparatus according to an embodiment of the present invention, where the apparatus mainly includes: a feature extraction module 61, a region suggestion module 62, a region of interest alignment module 63, a target detection module 64, an adaptive global average pooling processing module 65, a feature fusion module 66, and a category identification module 67, wherein:

the feature extraction module 61 is configured to perform feature extraction on the image to be identified to obtain a first feature vector.

And the region suggesting module 62 is configured to detect each foreground region and each background region in the image to be recognized according to the first feature vector obtained by the feature extracting module 61.

An interested region aligning module 63, configured to find, according to each foreground region and each corresponding region of each background region in the image to be identified, detected by the region suggesting module 62, each first sub-feature vector corresponding to each foreground region and each background region from the first feature vectors obtained by the feature extracting module 61; and respectively carrying out first interpolation processing on each first sub-feature vector to obtain each second sub-feature vector with a first fixed size. According to the corresponding region of each target region in the image detected by the target detection module 64, each first sub-feature vector corresponding to each target region is found in the first feature vectors obtained by the feature extraction module 61, and the first sub-feature vectors corresponding to all target regions are spliced into a third feature vector.

And the target detection module 64 is configured to perform target detection according to each second sub-feature vector obtained by the region-of-interest alignment module 63, so as to obtain each target region in the image to be identified.

The adaptive global average pooling processing module 65 is configured to perform adaptive global average pooling processing on the first feature vector obtained by the feature extracting module 61 to obtain a fourth feature vector; and the dimension of the fourth feature vector is the same as that of the third feature vector.

And the feature fusion module 66 is configured to superimpose the fourth feature vector obtained by the adaptive global average pooling processing module 65 and the third feature vector obtained by the region of interest aligning module 63 to obtain a fifth feature vector.

And the category identification module 67 is configured to perform target category identification according to the fifth feature vector obtained by the feature fusion module 66, so as to obtain a target category included in the image to be identified.

Fig. 7 is a schematic structural diagram of a target class recognition neural network training device according to an embodiment of the present invention, where the device mainly includes: an image acquisition module 71, a feature extraction module 72, a region suggestion module 73, a region of interest alignment module 74, a target detection module 75, an adaptive global average pooling processing module 76, a feature fusion module 77, a category identification module 78, and an adjustment module 79, wherein:

the image acquisition module 71 is configured to acquire a training image set, and label each labeled target area and a corresponding labeled target category in each frame of training image.

And the feature extraction module 72 is configured to sequentially extract a frame of training image from the training image set and input the frame of training image to the backbone network of the neural network for feature extraction, so as to obtain a first feature vector of the input training image.

A region suggestion module 73, configured to input the first feature vector to a region suggestion network of the neural network, so as to detect each foreground region and each background region in the input training image.

An interested region alignment module 74, configured to search, according to corresponding regions of each foreground region and each background region in the input training image, each first sub-feature vector corresponding to each foreground region and each background region in the first feature vector, and perform first interpolation processing on each first sub-feature vector corresponding to each foreground region and each background region, respectively, to obtain each second sub-feature vector of a first fixed size; according to the corresponding region of each target region in the input training image obtained by the target detection module 75, each first sub-feature vector corresponding to each target region is found in the first feature vector, and the first sub-feature vectors corresponding to all target regions are spliced into a third feature vector.

The target detection module 75 is configured to input each second sub-feature vector to a target detection network of the neural network, so as to obtain each detection target area and a detection target category of each detection target area in the input training image.

An adaptive global average pooling processing module 76, configured to perform adaptive global average pooling processing on the first feature vector to obtain a fourth feature vector; and the dimension of the fourth feature vector is the same as that of the third feature vector.

And a feature fusion module 77, configured to superimpose the fourth feature vector and the third feature vector to obtain a fifth feature vector.

And a class identification module 78, configured to input the fifth feature vector to a target classification network of the neural network, so as to obtain each detection target class included in the input training image.

An adjusting module 79, configured to calculate, according to the detection target regions and the detection target categories of the detection target regions obtained by the target detecting module 75, and the labeling target regions and the corresponding labeling target categories labeled in the input training image, by using a preset first loss function, so as to obtain a first prediction deviation; calculating by using a preset second loss function according to each detection target category contained in the input training image and each labeled target category of each labeled target region labeled in the input training image obtained by the category identification module 78 to obtain a second prediction deviation; carrying out weighted summation on the first prediction deviation and the second prediction deviation, and adjusting parameters of the neural network according to the weighted summation; when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

Embodiments of the present invention further provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the steps of the object class identification method described in any of the above embodiments are implemented.

Embodiments of the present invention also provide a computer-readable storage medium, which stores instructions that, when executed by a processor, may perform the steps in the object class identification method as described above. In practical applications, the computer readable medium may be included in each device/apparatus/system of the above embodiments, or may exist separately and not be assembled into the device/apparatus/system. Wherein instructions are stored in a computer readable storage medium, which stored instructions, when executed by a processor, may perform the steps in the object class identification method as described above.

According to embodiments disclosed herein, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example and without limitation: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing, without limiting the scope of the present disclosure. In the embodiments disclosed herein, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

As shown in fig. 8, an embodiment of the present invention further provides an electronic device. As shown in fig. 8, it shows a schematic structural diagram of an electronic device according to an embodiment of the present invention, specifically:

the electronic device may include a processor 81 of one or more processing cores, memory 82 of one or more computer-readable storage media, and a computer program stored on the memory and executable on the processor. The above-described object class identification method may be implemented when the program of the memory 82 is executed.

Specifically, in practical applications, the electronic device may further include a power supply 83, an input/output unit 84, and the like. Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 8 is not intended to be limiting of the electronic device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

the processor 81 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the server and processes data by operating or executing software programs and/or modules stored in the memory 82 and calling data stored in the memory 82, thereby performing overall monitoring of the electronic device.

The memory 82 may be used to store software programs and modules, i.e., the computer-readable storage media described above. The processor 81 executes various functional applications and data processing by executing software programs and modules stored in the memory 82. The memory 82 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 82 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 82 may also include a memory controller to provide the processor 81 with access to the memory 82.

The electronic device further comprises a power supply 83 for supplying power to each component, and the power supply 83 can be logically connected with the processor 81 through a power management system, so that functions of charging, discharging, power consumption management and the like can be managed through the power management system. The power supply 83 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input-output unit 84, the input-unit output 84 being operable to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. The input unit output 84 may also be used to display information input by or provided to the user as well as various graphical user interfaces, which may be composed of graphics, text, icons, video, and any combination thereof.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments disclosed herein. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments and/or claims of the present application may be combined and/or coupled in various ways, all of which fall within the scope of the present disclosure, without departing from the spirit and teachings of the present application.

The principles and embodiments of the present invention are explained herein using specific examples, which are provided only to help understanding the method and the core idea of the present invention, and are not intended to limit the present application. It will be appreciated by those skilled in the art that changes may be made in this embodiment and its broader aspects and without departing from the principles, spirit and scope of the invention, and that all such modifications, equivalents, improvements and equivalents as may be included within the scope of the invention are intended to be protected by the claims.

Claims

1. An object class identification method, characterized in that the method comprises:

2. The method according to claim 1, wherein the extracting features of the image to be recognized to obtain a first feature vector comprises:

3. The method of claim 2, wherein the backbone network, the area recommendation network, the object detection network, and the object classification network of the neural network are obtained by a training process comprising:

4. The method according to claim 1, wherein after the first sub-feature vectors corresponding to all the target regions are spliced into the third feature vector, and before the fourth feature vector is superimposed on the third feature vector, the method further comprises:

5. The method of claim 3, wherein each labeled target area and corresponding labeled target category are labeled in each frame of training image, and further comprising:

labeling the outline of each labeling target in each frame of training image;

6. The method of claim 1, wherein the image to be identified is a pipe image and the object class is a pipe defect class.

7. The method of claim 6, wherein the pipe defect category comprises one or any combination of the following: blind joint, deformation, misconnection, wall residue, penetration, corrosion, scum, scaling, undulation, tree root, disjointing, shedding, obstacle, stagger, deposition, leakage, rupture.

8. An object class identification device, characterized in that the device comprises:

9. An object class recognition neural network training apparatus, comprising:

10. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of the target class identification method of any one of claims 1 to 7.