WO2023202400A1 - 分割模型的训练方法及装置、图像识别方法及装置 - Google Patents

分割模型的训练方法及装置、图像识别方法及装置 Download PDF

Info

Publication number
WO2023202400A1
WO2023202400A1 PCT/CN2023/087270 CN2023087270W WO2023202400A1 WO 2023202400 A1 WO2023202400 A1 WO 2023202400A1 CN 2023087270 W CN2023087270 W CN 2023087270W WO 2023202400 A1 WO2023202400 A1 WO 2023202400A1
Authority
WO
WIPO (PCT)
Prior art keywords
network model
image
depth
target object
feature extraction
Prior art date
Application number
PCT/CN2023/087270
Other languages
English (en)
French (fr)
Inventor
胡永恒
马晨光
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023202400A1 publication Critical patent/WO2023202400A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • One or more embodiments of this specification relate to artificial intelligence technology, and in particular to training methods and devices for segmentation models, and image recognition methods and devices.
  • Image recognition refers to the technology of using computers to process, analyze and understand images to identify target objects of various different modes.
  • Image recognition technology is generally divided into face recognition and product recognition. Face recognition is mainly used in security inspection, identity verification and mobile payment; product recognition is mainly used in the commodity circulation process, especially unmanned shelves, smart retail cabinets and other unmanned systems. retail field.
  • the target object needs to be identified from various objects included in the image.
  • the interactive screen displays the data originally collected by the camera in real time.
  • this process will cause the faces of people who do not want to scan their faces to also appear on the screen, which will inevitably cause It creates a privacy-unfriendly feeling for users queuing up, and some users may even feel that their privacy has been violated. Therefore, it is necessary to segment the target face through image recognition.
  • the captured image may contain previously paid products and currently unpaid products held by the user. Therefore, it is necessary to segment the target product currently to be paid through image recognition.
  • One or more embodiments of this specification describe segmentation model training methods and devices, image recognition methods and devices, which can more accurately obtain segmentation information of target objects in images.
  • a training method for a segmentation model includes: a first network model, a second network model and a third network model, which includes: obtaining a sample image pair; wherein the sample image pair including the RGB image and the depth image obtained after shooting the same visual range; input the depth image into the first network model to obtain the first depth feature extraction result output by the first network model; combine the depth image with the The combined image of the RGB image is input into the second network model to obtain the edge features of the target object output by the second network model; the edge features of the target object and the first depth feature extraction result are input into the third network model to obtain the The segmentation result of the target object output by the third network model; according to the label of the sample image pair and the segmentation result of the target object, parameter adjustment is performed on the first network model, the second network model and the third network model.
  • the labels of the sample image pairs include: a first label and a second label; where the first label is a pre- The segmentation result of the RGB image or the depth image is first formed manually; the second label is obtained by performing Gaussian blur processing on the first label; the first network model, the second network Adjusting parameters of the model and the third network model includes: adjusting parameters of the second network model and the third network model according to the difference between the first label and the segmentation result of the target object; adjusting parameters of the second network model and the third network model according to the difference between the second label and the first depth feature The differences between the results are extracted and parameters of the first network model are adjusted.
  • the depth image after the depth image is input into the first network model, it further includes: obtaining the contour information of the target object extracted by the intermediate layer neural network included in the first network model, and converting the extracted information of the intermediate layer neural network into the first network model.
  • the extracted contour information of the target object is output to the second network model as the second depth feature extraction result; after the combined image of the depth image and the RGB image is input to the second network model, and after obtaining the Before the second network model outputs the edge features of the target object, it further includes: feature extraction of the combined image by front-end neural networks included in the second network model to obtain primary edge features;
  • the back-end neural networks of each layer process the primary edge features and the second depth feature extraction results to obtain and output the edge features of the target object.
  • the convolution kernel and convolution step size of the first network model and the second network model are adjusted so that the image size corresponding to the primary edge feature and the second depth feature extraction result is the same.
  • the depth image and the RGB image before combining the depth image and the RGB image, it further includes: normalizing the pixel values of the RGB image and the pixel value of the depth image, and converting the pixel values of the depth image into null values.
  • the pixel value of the pixel is normalized to 0.
  • an image recognition method which includes: acquiring an RGB image to be processed and a depth image to be processed obtained after shooting the same visual range; inputting the depth image to be processed into a first network model to obtain the third The depth feature extraction result output by a network model; input the combined image of the depth image to be processed and the RGB image to be processed into the second network model to obtain the edge features of the target object output by the second network model; The edge features and the depth feature extraction results are input into the third network model, and the segmentation result of the target object output by the third network model is obtained.
  • a training device for a segmentation model which includes: a sample image acquisition module configured to acquire a sample image pair; wherein the sample image pair includes the RGB image obtained after shooting the same visual range and the Depth image; a first network model training module configured to input the depth image into the first network model to obtain a first depth feature extraction result output by the first network model; a second network model training module configured to input the depth image to the first network model to obtain the first depth feature extraction result output by the first network model; The combined image of the depth image and the RGB image is input into the second network model to obtain the edge features of the target object output by the second network model; the third network model training module is configured to combine the edge features of the target object and the The first deep feature extraction result is input into the third network model, and the output of the third network model is obtained.
  • the segmentation result of the target object; the adjustment module is configured to adjust parameters of the first network model, the second network model and the third network model according to the label of the sample image pair and the segmentation result of the target object
  • the first network model training module is further configured to obtain the contour information of the target object extracted by the intermediate layer neural network included in the first network model, and use the contour information of the target object extracted by the intermediate layer neural network as The second depth feature extraction result is output to the second network model;
  • the second network model training module is further configured to control the front-end neural networks included in the second network model to perform feature extraction on the combined image to obtain The primary edge features, and the back-end neural networks included in the second network model are controlled to process the primary edge features and the second depth feature extraction results, so that the second network model outputs the edge features of the target object.
  • an image recognition device which includes: an image input module configured to obtain an RGB image to be processed and a depth image to be processed obtained after shooting the same visual range; a first network model configured to receive The depth image to be processed is used to obtain a depth feature extraction result; the second network model is configured to receive a combined image of the depth image to be processed and the RGB image to be processed to obtain edge features of the target object; third The network model is configured to receive the edge features of the target object and the depth feature extraction results to obtain the segmentation results of the target object.
  • a computing device including a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, it implements the method described in any embodiment of this specification. method.
  • the training method and device of the segmentation model and the image recognition method and device provided by the embodiments of this specification not only use the depth image in the initial stage of the training process (that is, combine the depth image with the RGB image, and use the combined image to obtain the target object edge features), and also uses the depth feature extraction results of the depth image in the subsequent stage of the training process. That is to say, the combined image and the depth feature extraction results are used to train the segmentation model at the same time. It can be seen that during the training process Different stages utilize the depth information provided by the depth image, so that the trained segmentation model can more accurately obtain the segmentation information of the target object in the image.
  • Figure 1 is a schematic diagram of a system architecture applied to an embodiment of this specification.
  • Figure 2 is a flow chart of a training method for a segmentation model in one embodiment of this specification.
  • Figure 3A is a schematic diagram of a training method for training the first segmentation model in an embodiment of this specification.
  • Figure 3B is a schematic diagram of the application structure of the first segmentation model in an embodiment of this specification.
  • Figure 4A is a schematic diagram of a training method for training the second segmentation model in one embodiment of this specification.
  • Figure 4B is a schematic diagram of the application structure of the second segmentation model in an embodiment of this specification.
  • Figure 5 is a flow chart of a method for segmentation model training using Mode 2 in an embodiment of this specification.
  • Figure 6 is a flow chart of an image recognition method in one embodiment of this specification.
  • Figure 7 is a schematic structural diagram of a training device for a segmentation model in an embodiment of this specification.
  • Figure 8 is a schematic structural diagram of an image recognition device in an embodiment of this specification.
  • the target object needs to be accurately segmented from the image.
  • the collected images include information about two portraits, and it is necessary to segment the information of the front and center target portrait to carry out business processes such as face payment.
  • the collected images include information about three products, and it is necessary to segment the information of the target product at the front and center to carry out business processes such as payment processing for the target product.
  • the system architecture mainly includes: RGB image capturing device, depth image capturing device, and image recognition device.
  • the RGB image capturing device can capture an RGB image
  • the depth image capturing device can capture a depth image
  • the image recognition device can perform foreground segmentation based on the RGB image and the depth image to segment the information of the target object.
  • the RGB image capturing device and the depth image capturing device are usually installed at the same location to capture the same visual range. Any one or more of the RGB image capturing device, the depth image capturing device and the image recognition device can be set up in an independent device, or can be integrated inside a POS (sales terminal) machine located in a business scene.
  • RGB image capturing devices depth image capturing devices
  • image recognition devices in Figure 1 is only illustrative. Any number can be selected and deployed based on implementation needs.
  • the devices in Figure 1 interact through the network.
  • the network can include various connection types, such as wired, wireless communication links, or fiber optic cables.
  • the method of the embodiment of this specification includes: first training a segmentation model based on RGB images and depth images, and then setting the segmentation model in the image recognition device, so that in actual applications, the RGB to be segmented can be The image and depth image are input into the segmentation model in the image recognition device to obtain the segmentation information of the target object.
  • Figure 2 is a flow chart of a training method for a segmentation model in one embodiment of this specification. It can be understood that this method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the segmentation model may be a joint model composed of multiple network models, specifically including: a first network model, a second network model, and a third network model.
  • the training method includes : Step 201: Obtain a sample image pair; wherein the sample image pair includes an RGB image and a depth image obtained after shooting the same visual range.
  • Step 203 Input the depth image into the first network model, and obtain the first depth feature extraction result output by the first network model.
  • Step 205 Input the combined image of the depth image and the RGB image into the second network model to obtain the edge features of the target object output by the second network model.
  • Step 207 Input the edge features of the target object and the first depth feature extraction result into the third network model, and obtain the segmentation result of the target object output by the third network model.
  • Step 209 Adjust parameters of the first network model, the second network model and the third network model according to the labels of the sample image pairs and the segmentation results of the target object.
  • a segmentation model needs to be trained.
  • the depth feature extraction result that is, the rough outline information of the target object, can be obtained through the depth image.
  • the edge detail information can be obtained through the RGB image. In this way, the comprehensive depth Images and RGB images can supplement the detailed information of the edges on the basis of the rough outline, thereby obtaining more accurate segmentation information of the target object.
  • the depth image is not only used in the initial stage of the training process (that is, combining the depth image with the RGB image to obtain the edge features of the target object), but also in the training process.
  • the subsequent stage uses the depth feature extraction results of the depth image.
  • the first network model, the second network model and the third network model it is possible to simultaneously use the combined image and the depth feature extraction results to train the segmentation model. It can be seen that during training Different stages in the process make use of the depth information provided by the depth image, so that the trained segmentation model can more accurately obtain the segmentation information of the target object in the image.
  • the depth information provided by the depth image will be used at different stages in the training process.
  • the different stages may include at least the following two types: Method 1: initial stage + final stage.
  • the depth information provided by the depth image is used twice in the initial stage and the final stage in one training process.
  • the network model 2 is input, through which the depth information provided by the primary depth image is utilized; secondly, because the depth image obtained is also input to the network model 1 at the same time, the network model 1 outputs the first
  • the first depth feature extraction result will be input to the network model 3 together with the edge features of the target object in the image finally output by the network model 2, through which the depth information provided by the depth image is reused through the network model 3.
  • Method 2 initial stage + intermediate stage + final stage.
  • the depth information provided by the depth image is used three times in a training process: the initial stage, the intermediate stage, and the final stage.
  • the network model 2 is input, and the depth information provided by the primary depth image is utilized through the network model 2; secondly, the captured depth image is The depth image will also be input to the network model 1 at the same time.
  • the middle layer neural network of the network model 1 will also obtain a preliminary outline information of the target object.
  • the preliminary outline information of the target object is not the final output of the network model 1, However, the contour information of the target object can also be reflected from a segmentation accuracy.
  • the contour information of the target object extracted by the intermediate layer neural network is provided as the second depth feature extraction result (ie, the intermediate result of depth feature extraction).
  • Network model 2 is processed by network model 2 using the edge features obtained from the combined image and the second depth feature extraction result, thereby obtaining the final output of network model 2.
  • the second depth feature extraction result uses the depth information provided by the depth image for the second time; finally, because the captured depth image will also be input to the network model 1 at the same time, the network model 1 outputs the first depth feature extraction result, which is the first depth feature extraction result.
  • the depth feature extraction results will be input to the network model 3 together with the edge features of the target object in the image finally output by the network model 2.
  • the depth information provided by the depth image is utilized for the third time.
  • the network model 3 will finally output the segmentation information of the target object, that is, the final output result of the segmentation model is obtained.
  • the parameters of the segmentation model can be adjusted to enable training of the segmentation model.
  • the labels of the sample image pairs may include: a first label and a second label; wherein the first label is a pre-artificially formed segmentation result of the RGB image or the depth image, that is, the true value of the segmentation result; the second label The label is obtained by performing Gaussian blur processing on the first label.
  • the parameters of the network model 2 are adjusted based on the difference between the first label and the segmentation result of the target object.
  • the first label that is, the segmentation result can also be The edge part of the real value is corroded to obtain the third label, and then the difference between the third label and the edge feature of the target object output by the network model 2 is used to adjust the parameters of the network model 2.
  • the training process is usually completed through multiple rounds of training until the segmentation model converges.
  • the application structure of the trained segmentation model that is, the segmentation model subsequently applied in the image recognition business process
  • Figure 3B the application structure of the trained segmentation model, that is, the segmentation model subsequently applied in the image recognition business process.
  • Figure 4B the application structure of the trained segmentation model, that is, the segmentation model subsequently applied in the image recognition business process.
  • Step 501 Obtain a sample image pair; wherein the sample image pair includes an RGB image and a depth image obtained after shooting the same visual range.
  • the RGB shooting device and the depth image shooting device are installed at the same place so that images within the same visual range can be captured.
  • they are installed on the POS machine at the cash register.
  • the RGB image capturing device and the depth image capturing device are used to capture the person currently waiting to pay from roughly the same position, and an RGB image and a depth image are obtained.
  • Both the RGB image and the depth image include portrait information. , and is likely to include information about multiple portraits.
  • Step 503 Input the depth image into network model 1.
  • the function of network model 1 is to extract depth structure information from the depth image, that is, the rough outline information of the target object.
  • the structure of network model 1 can be a multi-layer convolutional neural network.
  • Network model 1 may include MobileNetV2.
  • MobileNetV2 extracts the depth data features of each object in the image (such as depth face data features), and then uses a deconvolution operation to upsample the convolution result to 1/4 of the input depth image. .
  • Step 505 Combine the depth image and the RGB image, and input the resulting combined image into the network model 2.
  • the function of network model 2 is to use the information of the RGB image to complete the edge detail information in the target object outline information obtained through the depth image, so that the segmented target object outline is clearer and more accurate.
  • the structure of network model 2 can be a multi-layer convolutional neural network.
  • the combined image is actually generated by splicing the depth image and the RGB image together.
  • the original RGB image has 3 channels, and the depth image is spliced on the 4th channel to obtain the combined image.
  • the pixel values of the RGB image and the pixel value of the depth image can first be normalized, for example, normalized to a value between 0 and 1, where, for the depth image, Normalize the pixel values of pixels with null values in the depth image to 0.
  • Step 507 In network model 1, the intermediate layer neural network extracts the outline information of the target object; the outline information of the target object extracted by the intermediate layer neural network is output to the network as the second depth feature extraction result.
  • Model 2 In network model 1, the intermediate layer neural network extracts the outline information of the target object; the outline information of the target object extracted by the intermediate layer neural network is output to the network as the second depth feature extraction result.
  • Step 509 Network model 1 finally outputs the contour information of the target object in the image, and inputs the contour information of the target object to network model 3 as the first depth feature extraction result.
  • Step 511 In network model 2, each front-end neural network layer extracts features from the combined image to obtain primary edge features.
  • Step 513 In network model 2, each layer of back-end neural network processes the primary edge feature and the received second depth feature extraction result, and obtains and outputs the edge feature of the target object in the image to network model 3.
  • the convolution kernel and convolution step size of the network model 1 and the network model 2 can be adjusted so that the image size corresponding to the two intermediate results (ie, the primary edge feature and the second depth feature extraction result) same.
  • the back-end neural networks of each layer process the primary edge features (the second layer is no longer used). Deep feature extraction results), obtain and output the edge features of the target object in the image to the network model 3.
  • Step 515 In network model 3, process the edge features of the target object in the input image and the first depth feature extraction result to obtain and output the segmentation information of the target object.
  • Step 517 Adjust parameters of network model 2 and network model 3 according to the difference between the first label and the segmentation result of the target object.
  • the first label is the segmentation information of the target object in the above-mentioned RGB image or depth image formed manually in advance. Therefore, according to the difference between the first label and the segmentation information of the target object output by the network model 3, If the difference is detected, the parameters of network model 2 and network model 3 can be adjusted at the same time to optimize the segmentation model.
  • Step 519 Perform Gaussian blur processing on the first label to obtain the second label; adjust parameters of the network model 1 based on the difference between the second label and the first depth feature extraction result.
  • step 503 because the final output of network model 1 is usually 1/4 the size of the input image (in step 503, the convolution result is upsampled to 1/4 the size of the input depth image), therefore, in this step 519, You can reduce the first label to 1/4 of the original size and then perform Gaussian blur processing to obtain the second label.
  • the second label is generated based on the artificial label, that is, the first label. Therefore, according to the second label Based on the difference between the label and the first depth feature extraction result output by network model 1, the parameters of network model 1 can be adjusted to optimize network model 1, so that the optimized network model 1 can be used to continue in the subsequent training process. Train a segmentation model.
  • multiple sets of sample images may be used to perform the training process from step 501 to step 519 multiple times until the segmentation model converges.
  • the embodiment of this specification not only uses the original depth image in the initial stage of the training process (that is, combines the depth image with the RGB image, and uses the combined image to train the segmentation model), but also The processing results of the depth images are also used twice in the subsequent stages of the training process (that is, the depth feature extraction results are used to train the segmentation model). In other words, the combined images and the depth feature extraction results are used to train the segmentation model at the same time. It can be seen that , the depth information provided by the depth image is used at different stages in the training process, so that the trained segmentation model can more accurately obtain the segmentation information of the target object in the image.
  • the image recognition method includes: Step 601: Obtain the RGB image to be processed and the depth image to be processed obtained after shooting the same visual range.
  • Step 603 Input the depth image to be processed into the network model 1 in the segmentation model, and obtain the depth feature extraction result output by the network model 1.
  • Step 605 Input the combined image of the depth image to be processed and the RGB image to be processed into the network model 2 in the segmentation model, and obtain the edge features of the target object output by the network model 2.
  • Step 607 Input the edge features of the target object and the depth feature extraction result into the network model 3 in the segmentation model, and obtain the segmentation result of the target object output by the network model 3.
  • the segmentation information of the target object can be used for subsequent processing.
  • Business processing for example, because the information of the target portrait has been segmented from multiple portraits, liveness detection and face payment can be carried out in a targeted manner on the face of the target portrait. In this way, people can avoid During the face recognition process, the problem of having the faces of people who do not want to swipe their faces appear on the screen is solved. It also solves the problem of privacy insecurity caused by people in line appearing on the screen. The overall process is more friendly to user privacy.
  • a training device for a segmentation model includes: a sample image acquisition module 701 configured to acquire a sample image pair; wherein the sample image pair includes a pair of the same visual RGB images and depth images obtained after range shooting; the first network model training module 702 is configured to input the depth image into the first network model to obtain the first depth feature extraction result output by the first network model; the second network Model training module 703 is configured to combine the depth image with the RGB image, and The combined image is input into the second network model to obtain the edge features of the target object output by the second network model; the third network model training module 704 is configured to input the edge features of the target object and the first depth feature extraction result.
  • the third network model obtains the segmentation result of the target object output by the third network model; the adjustment module 705 is configured to adjust the first network model and the third network model according to the label of the sample image pair and the segmentation result of the target object. The parameters of the second network model and the third network model are adjusted.
  • the labels of the sample image pairs include: a first label and a second label; wherein the first label is a pre-formed artificial pair of the RGB image or The segmentation result of the depth image; the second label is obtained by performing Gaussian blur processing on the first label; the adjustment module 705 is configured to perform: according to the difference between the first label and the segmentation result of the target object, The parameters of the second network model and the third network model are adjusted; and the parameters of the first network model are adjusted according to the difference between the second label and the first depth feature extraction result.
  • the first network model training module 702 is further configured to perform: obtain the outline of the target object extracted by the intermediate layer neural network included in the first network model. information, and output the contour information of the target object extracted by the intermediate layer neural network as the second depth feature extraction result to the second network model; then the second network model training module 703 is further configured to perform: control the second The front-end neural networks included in the network model perform feature extraction on the combined image to obtain primary edge features, and control the back-end neural networks included in the second network model to process the primary edge features and the second depth feature extraction results, The second network model outputs edge features of the target object.
  • the adjustment module 705 is configured to perform: adjust the convolution kernel and convolution step size of the first network model and the second network model so that the primary edge features The image size corresponding to the second depth feature extraction result is the same.
  • the second network model training module 703 is further configured to perform: before combining the depth image with the RGB image, compare the pixel values and depth of the RGB image.
  • the pixel values of the image are normalized, and the pixel values of pixels with null values in the depth image are normalized to 0.
  • an image recognition device includes: an image input module 801, configured to obtain an RGB image to be processed and a depth image to be processed obtained after shooting the same visual range; a first network model 802, configured to receive the to-be-processed depth image to obtain the depth feature extraction result; the second network model 803 is configured to receive the combined image of the depth image to be processed and the RGB image to be processed to obtain the edge features of the target object; the third network model 804, It is configured to receive the edge feature of the target object and the depth feature extraction result to obtain the segmentation result of the target object.
  • One embodiment of this specification provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in the computer, the computer is caused to execute the method in any embodiment in the specification.
  • One embodiment of this specification provides a computing device, including a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, it implements the method in any embodiment of the specification.
  • the structures illustrated in the embodiments of this specification do not constitute specific limitations on the devices of the embodiments of this specification.
  • the above-mentioned device may include more or less components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本说明书实施例提供了一种分割模型的训练方法及装置,图像识别方法及装置。该分割模型包括:第一网络模型、第二网络模型以及第三网络模型。训练方法包括:获取样本图像对;样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像;将深度图像输入第一网络模型,得到第一深度特征提取结果;将深度图像与RGB图像的组合图像输入第二网络模型,得到目标对象的边缘特征;将目标对象的边缘特征以及第一深度特征提取结果输入第三网络模型,得到目标对象的分割结果;根据标签以及该目标对象的分割结果,对第一网络模型、第二网络模型以及第三网络模型进行参数调整。本说明书实施例能够更为准确地得到目标对象的分割信息。

Description

分割模型的训练方法及装置、图像识别方法及装置 技术领域
本说明书一个或多个实施例涉及人工智能技术,尤其涉及分割模型的训练方法及装置、图像识别方法和装置。
背景技术
图像识别,是指利用计算机对图像进行处理、分析和理解,以识别各种不同模式的目标对象的技术。图像识别技术一般分为人脸识别与商品识别等,人脸识别主要运用在安全检查、身份核验与移动支付中;商品识别主要运用在商品流通过程中,特别是无人货架、智能零售柜等无人零售领域。
在图像识别技术中,需要从图像包括的各种对象中识别出目标对象。比如,在人脸识别方案中,交互的屏幕采用的是实时展示相机原始采集的数据,但是这个过程在排队场景中,会导致不想刷脸的人员的脸也出现在屏幕上,这无形中会对排队用户产生一个隐私不友好的感受,部分用户甚至会感觉隐私被侵犯,因此需要通过图像识别分割出目标人脸。再如,在商品识别中,拍摄的图像中可能存在用户手握的之前已付款商品及当前待付款商品,因此,需要通过图像识别分割出当前待付款的目标商品。
然而相关技术的图像识别方法不能准确地分割出目标对象。
发明内容
本说明书一个或多个实施例描述了分割模型的训练方法及装置、图像识别方法和装置,能够更加准确地得到图像中的目标对象的分割信息。
根据第一方面,提供了一种分割模型的训练方法,该分割模型包括:第一网络模型、第二网络模型以及第三网络模型,其中包括:获取样本图像对;其中,所述样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像;将所述深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果;将所述深度图像与所述RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;将所述目标对象的边缘特征以及所述第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果;根据所述样本图像对的标签以及该目标对象的分割结果,对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整。
其中,所述样本图像对的标签包括:第一标签以及第二标签;其中,第一标签为预 先由人工形成的对所述RGB图像或所述深度图像的分割结果;第二标签为对所述第一标签进行高斯模糊处理后得到的;所述对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整,包括:根据第一标签与目标对象的分割结果之间的差异,对第二网络模型以及第三网络模型进行参数调整;根据第二标签与第一深度特征提取结果之间的差异,对第一网络模型进行参数调整。
其中,在所述将所述深度图像输入第一网络模型中之后,进一步包括:得到第一网络模型包括的中间层神经网络所提取出的目标对象的轮廓信息,将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给所述第二网络模型;在所述将所述深度图像与所述RGB图像的组合图像输入第二网络模型之后,并在得到该第二网络模型输出的目标对象的边缘特征之前,进一步包括:由所述第二网络模型包括的前端各层神经网络对所述组合图像进行特征提取,得到初级边缘特征;由第二网络模型包括的后端各层神经网络对该初级边缘特征以及第二深度特征提取结果进行处理,以便得到并输出所述目标对象的边缘特征。
其中,调整所述第一网络模型及所述第二网络模型的卷积核以及卷积步长,使得所述初级边缘特征与所述第二深度特征提取结果对应的图像尺寸相同。
其中,在将所述深度图像与所述RGB图像进行组合之前,进一步包括:对所述RGB图像的像素值和所述深度图像的像素值进行归一化,并且将深度图像中值为空的像素的像素值归一化为0。
根据第二方面,提供了图像识别方法,其中包括:获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像;将待处理的深度图像输入第一网络模型,得到该第一网络模型输出的深度特征提取结果;将待处理的深度图像与待处理的RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;将该目标对象的边缘特征以及该深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果。
根据第三方面,提供了分割模型的训练装置,其中包括:样本图像获取模块,配置为获取样本图像对;其中,样本图像对中包括对同一视觉范围拍摄后得到的所述RGB图像以及所述深度图像;第一网络模型训练模块,配置为将所述深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果;第二网络模型训练模块,配置为将所述深度图像与所述RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;第三网络模型训练模块,配置为将所述目标对象的边缘特征以及所述第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的 目标对象的分割结果;调整模块,配置为根据所述样本图像对的标签以及该目标对象的分割结果,对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整。
其中,第一网络模型训练模块,进一步被配置为得到第一网络模型包括的中间层神经网络所提取出的目标对象的轮廓信息,将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给所述第二网络模型;第二网络模型训练模块,进一步被配置为控制所述第二网络模型包括的前端各层神经网络对所述组合图像进行特征提取以得到初级边缘特征,以及控制第二网络模型包括的后端各层神经网络对该初级边缘特征以及第二深度特征提取结果进行处理,以由第二网络模型输出所述目标对象的边缘特征。
根据第四方面,提供了图像识别装置,其中,包括:图像输入模块,配置为获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像;第一网络模型,配置为接收所述待处理的深度图像,得到深度特征提取结果;第二网络模型,配置为接收所述待处理的深度图像与所述待处理的RGB图像的组合图像,得到目标对象的边缘特征;第三网络模型,配置为接收该目标对象的边缘特征以及该深度特征提取结果,得到目标对象的分割结果。
根据第五方面,提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现本说明书任一实施例所述的方法。
本说明书实施例提供的分割模型的训练方法及装置、图像识别方法及装置,不仅是在训练过程的初始阶段利用了深度图像(即,将深度图像与RGB图像进行组合,利用组合图像得到目标对象的边缘特征),而且还在训练过程的后续阶段利用了对深度图像的深度特征提取结果,也就是说,是同时利用组合图像以及深度特征提取结果来训练分割模型,可见,在训练过程中的不同阶段分别利用了深度图像提供的深度信息,从而使得训练出的分割模型能够更加准确地得到图像中的目标对象的分割信息。
附图说明
为了更清楚地说明本说明书实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本说明书一个实施例所应用的系统架构的示意图。
图2是本说明书一个实施例中分割模型的训练方法的流程图。
图3A是本说明书一个实施例中对第一种分割模型进行训练的训练方法的示意图。
图3B是本说明书一个实施例中第一种分割模型的应用结构示意图。
图4A本说明书一个实施例中对第二种分割模型进行训练的训练方法的示意图。
图4B是本说明书一个实施例中第二种分割模型的应用结构的示意图。
图5是本说明书一个实施例中采用方式2进行分割模型训练的方法的流程图。
图6是本说明书一个实施例中的图像识别方法的流程图。
图7是本说明书一个实施例中分割模型的训练装置的结构示意图。
图8是本说明书一个实施例中图像识别装置的结构示意图。
具体实施方式
如前所述,需要从图像中准确地分割出目标对象。比如在人脸识别过程中,采集的图像中包括2个人像的信息,需要分割出其中最靠前且居中的目标人像的信息,从而进行人脸支付等业务流程。再如,在商品识别过程中,采集的图像中包括3个商品的信息,需要分割出其中最靠前且居中的目标商品的信息,从而进行目标商品的付款处理等业务流程。
为了方便对本说明书提供的方法进行理解,首先对本说明书所涉及和适用的系统架构进行描述。如图1中所示,该系统架构中主要包括:RGB图像拍摄装置、深度图像拍摄装置、图像识别装置。
其中,RGB图像拍摄装置可以拍摄出RGB图像,深度图像拍摄装置可以拍摄出深度图像,图像识别装置可以基于RGB图像以及深度图像,进行前景分割,分割出目标对象的信息。在实际的应用中,RGB图像拍摄装置、深度图像拍摄装置通常设置在同一地点,以便针对同一视觉范围进行拍摄。RGB图像拍摄装置、深度图像拍摄装置以及图像识别装置中的任意一个或多个可以被设置于一个独立的设备中,也可以集成在位于业务场景中的POS(销售终端)机内部。
应该理解,图1中的RGB图像拍摄装置、深度图像拍摄装置、图像识别装置的数目仅仅是示意性的。根据实现需要,可以选择和布设任意数目。
图1中各装置通过网络交互。其中,网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等。
本说明书实施例的方法包括:首先基于RGB图像以及深度图像训练出一个分割模型,然后将分割模型设置在图像识别装置中,这样在实际应用中,就可以将待分割的RGB 图像以及深度图像输入图像识别装置中的分割模型,从而得到目标对象的分割信息。
下面说明在本说明书实施例中分割模型的训练方法。
图2是本说明书一个实施例中分割模型的训练方法的流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。参见图2,在本说明书实施例中,该分割模型可以是一种由多个网络模型组成的联合模型,具体包括:第一网络模型、第二网络模型以及第三网络模型,该训练方法包括:步骤201:获取样本图像对;其中,样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像。
步骤203:将深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果。
步骤205:将深度图像与RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征。
步骤207:将目标对象的边缘特征以及第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果。
步骤209:根据样本图像对的标签以及该目标对象的分割结果,对第一网络模型、第二网络模型以及第三网络模型进行参数调整。
从上述图2所示的流程中可以看出,为了能够更加准确地得到图像中的目标对象的分割信息,需要训练出分割模型。在对分割模型进行训练时,不仅利用了RGB图像而且还利用了深度图像,通过深度图像可以得到深度特征提取结果即目标对象的大致轮廓信息,通过RGB图像可以得到边缘细节信息,这样,综合深度图像及RGB图像,就可以在大致轮廓的基础上补充边缘的细节信息,从而得到更为准确的目标对象的分割信息。
并且,在上述图2所示过程中,不仅是在训练过程的初始阶段利用了深度图像(即,将深度图像与RGB图像进行组合,利用得到目标对象的边缘特征),而且还在训练过程的后续阶段利用了对深度图像的深度特征提取结果,通过使用第一网络模型、第二网络模型以及第三网络模型,实现了同时利用组合图像以及深度特征提取结果来训练分割模型,可见,在训练过程中的不同阶段分别利用了深度图像提供的深度信息,从而使得训练出的分割模型能够更加准确地得到图像中的目标对象的分割信息。
在本说明书的实施例中,如前所述,会在训练过程中的不同阶段分别利用深度图像提供的深度信息,该不同阶段至少可以包括如下两种:方式1:初始阶段+最后阶段。
在该方式1中,在一次训练过程中的初始阶段以及最后阶段中两次利用深度图像提供的深度信息。具体地,参见图3A,首先,在初始阶段,将拍摄得到的深度图像与RGB 图像进行组合后,输入网络模型2,通过该网络模型2利用了一次深度图像提供的深度信息;其次,因为拍摄得到的深度图像还会同时被输入网络模型1,由该网络模型1输出第一深度特征提取结果,该第一深度特征提取结果会与网络模型2最终输出的图像中目标对象的边缘特征一起被输入网络模型3,通过该网络模型3再次利用深度图像提供的深度信息。
方式2:初始阶段+中间阶段+最后阶段。
在该方式2中,在一次训练过程中的初始阶段、中间阶段以及最后阶段共三次利用深度图像提供的深度信息。具体地,参见图4A,首先,在初始阶段,将拍摄得到的深度图像与RGB图像进行组合后,输入网络模型2,通过该网络模型2利用了一次深度图像提供的深度信息;其次,拍摄得到的深度图像还会同时被输入网络模型1,该网络模型1的中间层神经网络也会得到一个初步的目标对象的轮廓信息,该初步的目标对象的轮廓信息虽然不是网络模型1的最终输出,但是也可以从一种分割精度上体现目标对象的轮廓信息,因此,该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果(即深度特征提取的中间结果)提供给网络模型2,由网络模型2利用根据组合图像得到的边缘特征以及该第二深度特征提取结果进行处理,从而得到网络模型2的最终输出,可见,在网络模型2的中间处理过程中通过利用第二深度特征提取结果第二次利用了深度图像提供的深度信息;最后,因为拍摄得到的深度图像还会同时被输入网络模型1,由该网络模型1输出第一深度特征提取结果,该第一深度特征提取结果会与网络模型2最终输出的图像中目标对象的边缘特征一起被输入网络模型3,通过该网络模型3第三次利用了深度图像提供的深度信息。
无论采用上述方式1还是方式2,在本说明书实施例中,网络模型3会最终输出目标对象的分割信息,也即得到了分割模型的最终输出结果,根据该分割信息以及样本图像对的标签就可以调整分割模型的参数,以便实现对分割模型的训练。
样本图像对的标签可以包括:第一标签以及第二标签;其中,第一标签为预先由人工形成的对所述RGB图像或所述深度图像的分割结果,即分割结果的真实值;第二标签为对所述第一标签进行高斯模糊处理后得到的。
在对分割模型进行参数调整时,参见图3A和图4A,根据第一标签与目标对象的分割结果之间的差异,对网络模型2以及网络模型3进行参数调整;根据第二标签与第一深度特征提取结果之间的差异,对网络模型1进行参数调整。
需要说明的是,在上述中,是根据第一标签与目标对象的分割结果之间的差异,对网络模型2进行参数调整。在本说明书的其他实施例中,也可以将第一标签即分割结果 的真实值的边缘部分进行腐蚀,得到第三标签,然后利用该第三标签与网络模型2输出的目标对象的边缘特征之间的差异,对网络模型2进行参数调整。
训练过程通常是通过多轮训练完成的,直至分割模型收敛。基于上述方式1,训练出的分割模型也即后续在图像识别业务过程中应用的分割模型的应用结构如图3B所示。基于上述方式2,训练出的分割模型也即后续在图像识别业务过程中应用的分割模型的应用结构如图4B所示。
下面结合具体的实施例并以采用上述方式2为例来对图2所示的过程进行详细说明。参见图4A、图4B以及图5,具体包括:步骤501:获取样本图像对;其中,样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像。
通常,RGB拍摄装置与深度图像拍摄装置安装在同一地点,以便能拍摄到相同视觉范围内的图像。比如,均安装在收银处的POS机上。以人脸支付场景为例,从大致相同的位置利用RGB图像拍摄装置及深度图像拍摄装置对当前待付款的人进行拍摄,得到RGB图像及深度图像,RGB图像及深度图像中均包括人像的信息,并且很可能会包括多个人像的信息。
步骤503:将深度图像输入网络模型1。
网络模型1的作用是:从深度图像中提取深度结构信息,也即目标对象的大致轮廓信息。网络模型1的结构可以是多层的卷积神经网络。
网络模型1可以包括MobileNetV2。在网络模型1中,MobileNetV2提取出图像中各个对象的深度数据特征(比如深度人脸数据特征),之后再使用反卷积操作,将卷积结果上采样到输入的深度图像的1/4大小。
步骤505:将深度图像与RGB图像进行组合,将得到的组合图像输入网络模型2。
网络模型2的作用是:利用RGB图像的信息对通过深度图像得到的目标对象轮廓信息中的边缘细节信息进行补全,从而使得分割出的目标对象的轮廓更加清晰准确。网络模型2的结构可以是多层的卷积神经网络。
在本步骤505中,组合图像实际上是将深度图像与RGB图像拼接在一起后生成的,比如原RGB图像为3通道,将深度图像拼接在第4个通道上,则得到了组合图像。
并且,在本步骤505中,为了便于处理,可以首先对RGB图像的像素值和深度图像的像素值进行归一化,比如归一化为0-1中的一个值,其中,对于深度图像,将深度图像中值为空的像素的像素值归一化为0。
步骤507:在网络模型1中,中间层神经网络提取出目标对象的轮廓信息;将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给网络 模型2。
步骤509:网络模型1最终输出图像中目标对象的轮廓信息,将该目标对象的轮廓信息作为第一深度特征提取结果输入到网络模型3。
步骤511:在网络模型2中,前端各层神经网络对组合图像进行特征提取,得到初级边缘特征。
步骤513:在网络模型2中,后端各层神经网络对该初级边缘特征以及接收到的第二深度特征提取结果进行处理,得到并向网络模型3输出图像中目标对象的边缘特征。
在本步骤513中,因为网络模型1的中间处理结果(即第二深度特征提取结果)与网络模型2的中间处理结果(即初级边缘特征)需要一起作为网络模型2中后端各层神经网络的输入,因此,需要使得该两个中间结果的大小相同,即图像的大小相同。在本说明书实施例中,可以调整网络模型1及网络模型2的卷积核以及卷积步长,使得该两个中间结果(即,初级边缘特征及第二深度特征提取结果)对应的图像尺寸相同。
需要说明的是,如果是利用上述方式1来实现分割模型的训练,那么,本步骤513中,在网络模型2中,后端各层神经网络对该初级边缘特征进行处理(不再利用第二深度特征提取结果),得到并向网络模型3输出图像中目标对象的边缘特征。
步骤515:在网络模型3中,对输入的图像中目标对象的边缘特征以及第一深度特征提取结果进行处理,得到并输出目标对象的分割信息。
步骤517:根据第一标签与目标对象的分割结果之间的差异,对网络模型2以及网络模型3进行参数调整。
在本步骤517中,第一标签是预先由人工形成的对上述RGB图像或者深度图像中目标对象的分割信息,因此,根据该第一标签与网络模型3输出的目标对象的分割信息之间的差异,就可以同时调整网络模型2、网络模型3的参数,以便优化分割模型。
步骤519:对第一标签进行高斯模糊处理,得到第二标签;根据第二标签与第一深度特征提取结果之间的差异,对网络模型1进行参数调整。
参见图3A和图4A,可以首先对网络模型1输出的第一深度特征提取结果进行反卷积操作,然后再利用反卷积操作的结果与第二标签之间的差异,对网络模型1进行参数调整。
参见上述步骤503,因为网络模型1的最终输出通常为输入图像的1/4大小(步骤503中将卷积结果上采样到输入的深度图像的1/4大小),因此,本步骤519中,可以将第一标签缩小到原始大小的1/4之后,再进行高斯模糊处理,得到第二标签。
在本步骤519中,第二标签是根据人工标签即第一标签生成的,因此,根据该第二 标签与网络模型1输出的第一深度特征提取结果之间的差异,就可以调整网络模型1的参数,以便优化网络模型1,从而能够在后续训练过程中,利用优化后的网络模型1来继续训练分割模型。
在本说明书的实施例中,可以利用多组样本图像执行多次步骤501至步骤519的训练过程,直至分割模型收敛。
根据上述图5所示过程可以看出,本说明书实施例不仅是在训练过程的初始阶段利用了原始的深度图像(即,将深度图像与RGB图像进行组合,利用组合图像训练分割模型),而且还在训练过程的后续阶段两次利用了对深度图像的处理结果(即,利用深度特征提取结果训练分割模型),也就是说,是同时利用组合图像以及深度特征提取结果来训练分割模型,可见,在训练过程中的不同阶段分别利用了深度图像提供的深度信息,从而使得训练出的分割模型能够更加准确地得到图像中的目标对象的分割信息。
可以利用本说明书任一实施例方法所训练出的分割模型进行图像识别。参见图6,在本说明书一个实施例中,图像识别方法包括:步骤601:获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像。
步骤603:将待处理的深度图像输入分割模型中的网络模型1,得到该网络模型1输出的深度特征提取结果。
步骤605:将待处理的深度图像与待处理的RGB图像的组合图像输入分割模型中的网络模型2,得到该网络模型2输出的目标对象的边缘特征。
步骤607:将该目标对象的边缘特征以及该深度特征提取结果输入分割模型中的网络模型3,得到该网络模型3输出的目标对象的分割结果。
在利用上述图6所示方法从图像中得到目标对象的分割信息(比如当前待收款的人像信息,或者当前待收款的商品信息)之后,则可以利用该目标对象的分割信息进行后续的业务处理,比如,因为已经从多个人像中分割出目标人像的信息,因此,就可以有针对性的对该目标人像的人脸进行活体检测及人脸支付,通过这种方式,可以避免人脸识别过成中不想刷脸的人的脸出现在屏幕上的问题,同时解决排队人员出现在屏幕上导致的隐私不安全感的问题,整体流程对用户隐私更友好。
在本说明书的一个实施例中,提供了一种分割模型的训练装置,参见图7,该装置包括:样本图像获取模块701,配置为获取样本图像对;其中,样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像;第一网络模型训练模块702,配置为将所述深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果;第二网络模型训练模块703,配置为将所述深度图像与所述RGB图像组合,将该 组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;第三网络模型训练模块704,配置为将所述目标对象的边缘特征以及所述第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果;调整模块705,配置为根据所述样本图像对的标签以及该目标对象的分割结果,对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整。
在上述图7所示的本说明书的训练装置的一个实施例中,样本图像对的标签包括:第一标签以及第二标签;其中,第一标签为预先由人工形成的对所述RGB图像或所述深度图像的分割结果;第二标签为对所述第一标签进行高斯模糊处理后得到的;调整模块705被配置为执行:根据第一标签与目标对象的分割结果之间的差异,对第二网络模型以及第三网络模型进行参数调整;根据第二标签与第一深度特征提取结果之间的差异,对第一网络模型进行参数调整。
在上述图7所示的本说明书的训练装置的一个实施例中,第一网络模型训练模块702进一步被配置为执行:得到第一网络模型包括的中间层神经网络所提取出的目标对象的轮廓信息,将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给所述第二网络模型;则第二网络模型训练模块703进一步被配置为执行:控制第二网络模型包括的前端各层神经网络对组合图像进行特征提取以得到初级边缘特征,以及控制第二网络模型包括的后端各层神经网络对该初级边缘特征以及第二深度特征提取结果进行处理,以由第二网络模型输出目标对象的边缘特征。
在上述图7所示的本说明书的训练装置的一个实施例中,调整模块705被配置为执行:调整第一网络模型及第二网络模型的卷积核以及卷积步长,使得初级边缘特征与第二深度特征提取结果对应的图像尺寸相同。
在上述图7所示的本说明书的训练装置的一个实施例中,第二网络模型训练模块703进一步被配置为执行:在将深度图像与RGB图像进行组合之前,对RGB图像的像素值和深度图像的像素值进行归一化,并且将深度图像中值为空的像素的像素值归一化为0。
在本说明书一个实施例中,提出了一种图像识别装置。参见图8,该装置包括:图像输入模块801,配置为获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像;第一网络模型802,配置为接收所述待处理的深度图像,得到深度特征提取结果;第二网络模型803,配置为接收所述待处理的深度图像与所述待处理的RGB图像的组合图像,得到目标对象的边缘特征;第三网络模型804,配置为接收该目标对象的边缘特征以及该深度特征提取结果,得到目标对象的分割结果。
本说明书一个实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当 计算机程序在计算机中执行时,令计算机执行说明书中任一个实施例中的方法。
本说明书一个实施例提供了一种计算设备,包括存储器和处理器,存储器中存储有可执行代码,处理器执行可执行代码时,实现执行说明书中任一个实施例中的方法。
可以理解的是,本说明书实施例示意的结构并不构成对本说明书实施例的装置的具体限定。在说明书的另一些实施例中,上述装置可以包括比图示更多或者更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或者软件和硬件的组合来实现。
上述装置、系统内的各模块之间的信息交互、执行过程等内容,由于与本说明书方法实施例基于同一构思,具体内容可参见本说明书方法实施例中的叙述,此处不再赘述。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本公开所描述的功能可以用硬件、软件、挂件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上的具体实施方式,对本公开的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上仅为本公开的具体实施方式而已,并不用于限定本公开的保护范围,凡在本公开的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本公开的保护范围之内。

Claims (10)

  1. 一种分割模型的训练方法,该分割模型包括:第一网络模型、第二网络模型以及第三网络模型,其中,所述方法包括:
    获取样本图像对;其中,所述样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像;
    将所述深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果;
    将所述深度图像与所述RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;
    将所述目标对象的边缘特征以及所述第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果;
    根据所述样本图像对的标签以及该目标对象的分割结果,对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整。
  2. 根据权利要求1所述的方法,其中,
    所述样本图像对的标签包括:第一标签以及第二标签;其中,第一标签为预先由人工形成的对所述RGB图像或所述深度图像的分割结果;第二标签为对所述第一标签进行高斯模糊处理后得到的;
    所述对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整,包括:
    根据第一标签与目标对象的分割结果之间的差异,对第二网络模型以及第三网络模型进行参数调整;
    根据第二标签与第一深度特征提取结果之间的差异,对第一网络模型进行参数调整。
  3. 根据权利要求1所述的方法,其中,
    在所述将所述深度图像输入第一网络模型中之后,进一步包括:得到第一网络模型包括的中间层神经网络所提取出的目标对象的轮廓信息,将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给所述第二网络模型;
    在所述将所述深度图像与所述RGB图像的组合图像输入第二网络模型之后,并在得到该第二网络模型输出的目标对象的边缘特征之前,进一步包括:
    由所述第二网络模型包括的前端各层神经网络对所述组合图像进行特征提取,得到初级边缘特征;
    由第二网络模型包括的后端各层神经网络对该初级边缘特征以及第二深度特征提取结果进行处理,以便得到并输出所述目标对象的边缘特征。
  4. 根据权利要求3所述的方法,其中,调整所述第一网络模型及所述第二网络模型的卷积核以及卷积步长,使得所述初级边缘特征与所述第二深度特征提取结果对应的图像尺寸相同。
  5. 根据权利要求1所述的方法,其中,在将所述深度图像与所述RGB图像进行组合之前,进一步包括:
    对所述RGB图像的像素值和所述深度图像的像素值进行归一化,并且将深度图像中值为空的像素的像素值归一化为0。
  6. 一种图像识别方法,包括:
    获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像;
    将待处理的深度图像输入第一网络模型,得到该第一网络模型输出的深度特征提取结果;
    将待处理的深度图像与待处理的RGB图像的组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;
    将该目标对象的边缘特征以及该深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果。
  7. 一种分割模型的训练装置,包括:
    样本图像获取模块,配置为获取样本图像对;其中,样本图像对中包括对同一视觉范围拍摄后得到的RGB图像以及深度图像;
    第一网络模型训练模块,配置为将所述深度图像输入第一网络模型,得到该第一网络模型输出的第一深度特征提取结果;
    第二网络模型训练模块,配置为将所述深度图像与所述RGB图像组合,将该组合图像输入第二网络模型,得到该第二网络模型输出的目标对象的边缘特征;
    第三网络模型训练模块,配置为将所述目标对象的边缘特征以及所述第一深度特征提取结果输入第三网络模型,得到该第三网络模型输出的目标对象的分割结果;
    调整模块,配置为根据所述样本图像对的标签以及该目标对象的分割结果,对所述第一网络模型、第二网络模型以及第三网络模型进行参数调整。
  8. 根据权利要求7所述的装置,其中,
    第一网络模型训练模块进一步被配置为执行:得到第一网络模型包括的中间层神经网络所提取出的目标对象的轮廓信息,将该中间层神经网络所提取出的目标对象的轮廓信息作为第二深度特征提取结果输出给所述第二网络模型;
    第二网络模型训练模块进一步被配置为执行:控制所述第二网络模型包括的前端各 层神经网络对所述组合图像进行特征提取以得到初级边缘特征,以及控制第二网络模型包括的后端各层神经网络对该初级边缘特征以及第二深度特征提取结果进行处理,以由第二网络模型输出所述目标对象的边缘特征。
  9. 一种图像识别装置,包括:
    图像输入模块,配置为获取对同一视觉范围拍摄后得到的待处理的RGB图像以及待处理的深度图像;
    第一网络模型,配置为接收所述待处理的深度图像,得到深度特征提取结果;
    第二网络模型,配置为接收所述待处理的深度图像与所述待处理的RGB图像的组合图像,得到目标对象的边缘特征;
    第三网络模型,配置为接收该目标对象的边缘特征以及该深度特征提取结果,得到目标对象的分割结果。
  10. 一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-6中任一项所述的方法。
PCT/CN2023/087270 2022-04-19 2023-04-10 分割模型的训练方法及装置、图像识别方法及装置 WO2023202400A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210407639.6A CN114913338A (zh) 2022-04-19 2022-04-19 分割模型的训练方法及装置、图像识别方法及装置
CN202210407639.6 2022-04-19

Publications (1)

Publication Number Publication Date
WO2023202400A1 true WO2023202400A1 (zh) 2023-10-26

Family

ID=82765109

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087270 WO2023202400A1 (zh) 2022-04-19 2023-04-10 分割模型的训练方法及装置、图像识别方法及装置

Country Status (2)

Country Link
CN (1) CN114913338A (zh)
WO (1) WO2023202400A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913338A (zh) * 2022-04-19 2022-08-16 支付宝(杭州)信息技术有限公司 分割模型的训练方法及装置、图像识别方法及装置
CN115797817B (zh) * 2023-02-07 2023-05-30 科大讯飞股份有限公司 一种障碍物识别方法、障碍物显示方法、相关设备和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215236A1 (zh) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) 图像语义分割方法和系统
CN112634296A (zh) * 2020-10-12 2021-04-09 深圳大学 门机制引导边缘信息蒸馏的rgb-d图像语义分割方法及终端
CN113743417A (zh) * 2021-09-03 2021-12-03 北京航空航天大学 语义分割方法和语义分割装置
US20220066456A1 (en) * 2016-02-29 2022-03-03 AI Incorporated Obstacle recognition method for autonomous robots
CN114913338A (zh) * 2022-04-19 2022-08-16 支付宝(杭州)信息技术有限公司 分割模型的训练方法及装置、图像识别方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220066456A1 (en) * 2016-02-29 2022-03-03 AI Incorporated Obstacle recognition method for autonomous robots
WO2020215236A1 (zh) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) 图像语义分割方法和系统
CN112634296A (zh) * 2020-10-12 2021-04-09 深圳大学 门机制引导边缘信息蒸馏的rgb-d图像语义分割方法及终端
CN113743417A (zh) * 2021-09-03 2021-12-03 北京航空航天大学 语义分割方法和语义分割装置
CN114913338A (zh) * 2022-04-19 2022-08-16 支付宝(杭州)信息技术有限公司 分割模型的训练方法及装置、图像识别方法及装置

Also Published As

Publication number Publication date
CN114913338A (zh) 2022-08-16

Similar Documents

Publication Publication Date Title
WO2023202400A1 (zh) 分割模型的训练方法及装置、图像识别方法及装置
KR102299847B1 (ko) 얼굴 인증 방법 및 장치
EP3525165B1 (en) Method and apparatus with image fusion
US20230260321A1 (en) System And Method For Scalable Cloud-Robotics Based Face Recognition And Face Analysis
US11321575B2 (en) Method, apparatus and system for liveness detection, electronic device, and storage medium
WO2022001509A1 (zh) 图像优化方法、装置、计算机存储介质以及电子设备
CN108389224B (zh) 图像处理方法及装置、电子设备和存储介质
CN113343826B (zh) 人脸活体检测模型的训练方法、人脸活体检测方法及装置
CN110188670B (zh) 一种虹膜识别中的人脸图像处理方法、装置和计算设备
CN108388889B (zh) 用于分析人脸图像的方法和装置
CN111160202A (zh) 基于ar设备的身份核验方法、装置、设备及存储介质
CN111222433B (zh) 自动人脸稽核方法、系统、设备及可读存储介质
US20230036338A1 (en) Method and apparatus for generating image restoration model, medium and program product
JP2020013553A (ja) 端末装置に適用される情報生成方法および装置
CN112396050B (zh) 图像的处理方法、设备以及存储介质
CN112991180A (zh) 图像拼接方法、装置、设备以及存储介质
EP3017399B1 (en) Payment card ocr with relaxed alignment
CN113221767B (zh) 训练活体人脸识别模型、识别活体人脸的方法及相关装置
CN113642639A (zh) 活体检测方法、装置、设备和存储介质
CN113255512B (zh) 用于活体识别的方法、装置、设备以及存储介质
CN114764839A (zh) 动态视频生成方法、装置、可读存储介质及终端设备
CN115205939B (zh) 人脸活体检测模型训练方法、装置、电子设备及存储介质
CN113642428B (zh) 人脸活体检测方法、装置、电子设备及存储介质
US10282602B1 (en) Systems and methods for capturing electronic signatures
US20230061009A1 (en) Document detection in digital images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791056

Country of ref document: EP

Kind code of ref document: A1