WO2020177651A1 - 图像分割方法和图像处理装置 - Google Patents

图像分割方法和图像处理装置 Download PDF

Info

Publication number
WO2020177651A1
WO2020177651A1 PCT/CN2020/077366 CN2020077366W WO2020177651A1 WO 2020177651 A1 WO2020177651 A1 WO 2020177651A1 CN 2020077366 W CN2020077366 W CN 2020077366W WO 2020177651 A1 WO2020177651 A1 WO 2020177651A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
image
matrix
feature
elements
Prior art date
Application number
PCT/CN2020/077366
Other languages
English (en)
French (fr)
Inventor
田值
贺通
沈春华
颜友亮
许松岑
周一韧
吴小飞
刘健庄
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020177651A1 publication Critical patent/WO2020177651A1/zh
Priority to US17/383,181 priority Critical patent/US12008797B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/457Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • This application relates to the field of computer vision, and in particular to an image segmentation method and image processing device.
  • Computer vision is an inseparable part of various intelligent systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain what we need Yes, the knowledge of the subject’s data and information. Vividly speaking, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision is to use various imaging systems instead of visual organs to obtain input images, and then the computer replaces the brain to complete the processing and interpretation of these input images.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • image semantic segmentation An important part of image understanding in computer vision technology is image semantic segmentation (semantic segmentation). More and more application scenarios require accurate and efficient image semantic segmentation technology, such as photography, video playback, autonomous driving, indoor navigation, and even virtual reality and augmented reality.
  • Image semantic segmentation is to accurately segment the part of the input image that needs to be processed, and then perform corresponding processing on the different parts of the segmentation. For example, a user can use a mobile terminal to take an image, and then automatically segment the portrait area in the captured image to add special effects, such as adjusting the depth of field, changing the background, retaining only the color of the portrait area, The image area is blurred and so on.
  • CNN convolutional neural network
  • the embodiments of the present application provide an image segmentation method and image processing device, which perform information fusion on high-level feature maps, which can improve the segmentation accuracy and reduce the amount of calculation and memory overhead.
  • an embodiment of the present application provides an image segmentation method, which includes: obtaining an input image and processing requirements, where the processing requirements are used to instruct the target feature map group obtained by performing image segmentation on the input image to perform target Processing; performing multi-level feature extraction on the input image to obtain multiple feature maps; down-sampling the multiple feature maps to obtain multiple feature maps with a reference resolution; the reference resolution is lower than all The resolution of the input image; fuse the multiple feature maps with reference resolution to obtain at least one feature map group; use the transformation matrix W to up-sample the feature map group to obtain the target feature map group , The target feature map group and the input image have the same resolution; wherein the transformation matrix W is obtained by modeling the training data of the image segmentation task; one of the dimensions of the transformation matrix W and the feature The number of channels in the group is the same; according to the processing requirements, the target processing is performed on the target feature map group to obtain a target image.
  • a transformation matrix is used to up-sample the feature map group obtained by fusion of lower resolution feature maps to obtain the target feature map group, which can effectively reduce the memory occupation and the amount of calculation, and the accuracy of image segmentation is high.
  • the up-sampling of the feature map group by using the transformation matrix W to obtain the target feature map group includes: respectively calculating (H ⁇ W) one-dimensional matrices each including C elements and The product of the transformation matrix W results in (H ⁇ W) one-dimensional matrices each including P elements; the (H ⁇ W) one-dimensional matrices each including C elements include all elements.
  • the feature map group includes C (H ⁇ W) two-dimensional matrix elements at the same position in each two-dimensional matrix, H and W are the two dimensions of the feature map group, and C is the feature map The number of channels in the group;
  • the target feature map group includes at least one
  • the (A ⁇ B ⁇ N) sub-matrix is obtained from one of the (H ⁇ W) one-dimensional matrices each including P elements; among them, H, W, C, N, P, M, Both A and B are integers greater than zero.
  • the transformation matrix is used to up-sample the feature map group obtained by fusion of the feature maps of lower resolution, and the image segmentation result of the input image can be quickly obtained, and the operation is simple.
  • the respectively performing feature arrangement on the (H ⁇ W) one-dimensional matrices each including P elements to obtain the target feature map group includes: according to the (H ⁇ W) ) Any one of the one-dimensional matrices each including P elements, determine (A ⁇ B) one-dimensional matrices each including N elements; will be determined from one of the (A ⁇ B) one-dimensional matrices each including N elements A (A ⁇ B ⁇ N) three-dimensional matrix obtained from the three-dimensional matrix is used as a sub-matrix included in the target feature map group.
  • (H ⁇ W) one-dimensional matrices each including P elements are used to perform feature arrangement to obtain the target feature map group, which is simple to implement.
  • any one of the M annotated images is a (H ⁇ W ⁇ N) three-dimensional matrix
  • the transformation matrix W is obtained by using the following operations: respectively obtaining the M At least one (A ⁇ B ⁇ N) sub-matrix corresponding to each of the annotated images to obtain multiple (A ⁇ B ⁇ N) sub-matrices; from the multiple (A ⁇ B ⁇ N) The sub-matrix of, obtains multiple vectors including P elements; wherein, from each of the multiple (A ⁇ B ⁇ N) sub-matrices, a vector including P elements is obtained; and the multiple A vector including P elements is subjected to principal component analysis to obtain a (P ⁇ P) two-dimensional matrix; a (C ⁇ P) sub-matrix included in the (P ⁇ P) two-dimensional matrix is used as the transformation Matrix W.
  • the transformation matrix is obtained by using the annotated image, so that the transformation matrix is used to up-sample the feature map group obtained by fusing the feature maps of lower resolution.
  • the performing multi-level feature extraction on the input image to obtain multiple feature maps includes: performing a convolution operation on the input image to obtain a first feature map, and comparing the (K-th) 1) Convolution of the feature map to obtain the Kth feature map; the Kth feature map is a feature map of the reference resolution, and the resolution of the (K-1)th feature map is not higher than the resolution of the (K-1)th feature map.
  • the resolution of the K feature map, K is an integer greater than 1, and the multiple feature maps include K feature maps;
  • the down-sampling of the multiple feature maps to obtain multiple feature maps with reference resolution includes : Down-sampling the first feature map to obtain a feature map of the reference resolution, and down-sampling the (K-1)th feature map to obtain a feature map of the reference resolution.
  • the fusing the plurality of feature maps with reference resolution to obtain at least one feature map group includes: aligning the plurality of feature maps with reference resolution in the channel dimension Perform splicing to obtain the at least one feature map group; the feature map group is a (H ⁇ W ⁇ C) three-dimensional matrix and corresponds to the C (H ⁇ W) two-dimensional matrices; the respective calculation ( The product of (H ⁇ W) one-dimensional matrices each including C elements and the transformation matrix W to obtain (H ⁇ W) one-dimensional matrices each including P elements includes: calculating each of the feature map groups separately The product of the one-dimensional matrix corresponding to the position of each element and the transformation matrix to obtain the (H ⁇ W) one-dimensional matrixes each including P elements; the one-dimensional matrix corresponding to the position of one element in the feature map group The included elements are the elements at the same element position in each of the C (H ⁇ W) two-dimensional matrices.
  • the product of the one-dimensional matrix corresponding to each element position in the feature map group and the transformation matrix is calculated separately to obtain (H ⁇ W) one-dimensional matrices each including P elements, so as to use the ( H ⁇ W) one-dimensional matrixes each including P elements are arranged to obtain the target feature map group, and the operation is simple.
  • the method further includes: obtaining the transformation matrix W; processing training samples using a convolutional neural network to obtain image segmentation results of the training samples; the training samples are included in all The training data; according to the image segmentation result of the training sample and the standard result corresponding to the training sample, determine the loss corresponding to the training sample; the standard result is the expectation of using the convolutional neural network to process the training sample
  • the multi-level feature extraction of the input image to obtain multiple feature maps includes: The input image is input to the convolutional neural network to perform multi-level feature extraction to obtain the multiple feature maps.
  • the convolutional neural network is trained to facilitate multi-level feature extraction of the input image using the convolutional neural network to obtain multiple feature maps.
  • an embodiment of the present application provides an image processing device, the image processing device includes: an acquisition unit for acquiring an input image and processing requirements, the processing requirements are used to instruct to perform image segmentation on the input image to obtain Target processing is performed on the target feature map group of the target; the processing unit is used to perform multi-level feature extraction on the input image to obtain multiple feature maps; down-sampling the multiple feature maps to obtain multiple reference resolutions The feature map; the reference resolution is lower than the resolution of the input image; the multiple feature maps with the reference resolution are fused to obtain at least one feature map group; the feature map is transformed by the transformation matrix W The group is up-sampled to obtain the target feature map group, the target feature map group and the input image have the same resolution; wherein, the transformation matrix W is obtained by modeling the training data of the image segmentation task; One of the dimensions of the transformation matrix W is the same as the number of channels of the feature group; according to the processing requirement, the target processing is performed on the target feature map group to obtain a target image.
  • the processing unit is specifically configured to determine, according to any one of the (H ⁇ W) one-dimensional matrices each including P elements, that (A ⁇ B) all include A one-dimensional matrix of N elements; a (A ⁇ B ⁇ N) three-dimensional matrix obtained from the (A ⁇ B) one-dimensional matrices each including N elements is used as a sub-component of the target feature map group matrix.
  • any one of the M annotated images is a (H ⁇ W ⁇ N) three-dimensional matrix; the processing unit is configured to obtain the At least one (A ⁇ B ⁇ N) sub-matrix corresponding to each annotated image is used to obtain multiple (A ⁇ B ⁇ N) sub-matrices; the multiple (A ⁇ B ⁇ N) sub-matrices are obtained A vector including P elements; wherein, from each of the plurality of (A ⁇ B ⁇ N) sub-matrices, a vector including P elements is obtained; and the plurality of P elements The vector performs principal component analysis to obtain a (P ⁇ P) two-dimensional matrix; a (C ⁇ P) sub-matrix included in the (P ⁇ P) two-dimensional matrix is used as the transformation matrix W.
  • the processing unit is specifically configured to perform a convolution operation on the input image to obtain a first feature map, and perform a convolution operation on the (K-1)th feature map to obtain a Kth feature map.
  • the Kth feature map is a feature map of the reference resolution, the resolution of the (K-1)th feature map is not higher than the resolution of the Kth feature map, and K is an integer greater than 1.
  • the multiple feature maps include K feature maps; down-sampling the first feature map to obtain a feature map of the reference resolution, and down-sampling the (K-1)th feature map to obtain A feature map of the reference resolution.
  • the processing unit is specifically configured to stitch the multiple feature maps with reference resolution in the channel dimension to obtain the at least one feature map group; the feature map group Is a (H ⁇ W ⁇ C) three-dimensional matrix corresponding to the C (H ⁇ W) two-dimensional matrices; respectively calculate the one-dimensional matrix corresponding to the position of each element in the feature map group and the transformation
  • the one-dimensional matrix corresponding to an element position in the feature map group includes the elements of the C (H ⁇ W ) The element at the same element position in each two-dimensional matrix in the two-dimensional matrix.
  • the processing unit is further configured to obtain the transformation matrix W; use a convolutional neural network to process training samples to obtain image segmentation results of the training samples; and the training samples include In the training data; according to the image segmentation result of the training sample and the standard result corresponding to the training sample, determine the loss corresponding to the training sample; the standard result is to use the convolutional neural network to process the training The expected result of the sample; using the loss corresponding to the training sample to update the parameters of the convolutional neural network through an optimization algorithm; the processing unit is specifically configured to input the input image into the convolutional neural network for processing Multi-level feature extraction is performed to obtain the multiple feature maps.
  • an embodiment of the present application provides another image processing apparatus, including a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to call the program instructions to execute the method of the first aspect described above.
  • an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processing The device executes the method of the first aspect described above.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a convolutional neural network provided by an embodiment of this application.
  • Figure 3 is a chip hardware structure provided by an embodiment of the application.
  • FIG. 4 is a flowchart of a method for training a convolutional neural network provided by an embodiment of the application
  • FIG. 5 is a flowchart of a method for generating a transformation matrix from training samples according to an embodiment of the application
  • FIG. 6 is a schematic diagram of a calculation process of a transformation matrix provided by an embodiment of the application.
  • FIG. 7 is a flowchart of an image segmentation method provided by an embodiment of the application.
  • FIG. 8 is a schematic diagram of using a convolutional neural network to process an input image to obtain feature maps with K reference resolutions according to an embodiment of the application;
  • FIG. 9 is a schematic diagram of an upsampling process provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of a feature map fusion process and an upsampling process provided by an embodiment of this application;
  • FIG. 11 is a schematic diagram of a feature map fusion process provided by an embodiment of this application.
  • FIG. 12 is a flowchart of another image segmentation method provided by an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of an image processing apparatus provided by this application.
  • FIG. 14 is a schematic structural diagram of a processing unit provided by this application.
  • FIG. 15 is a schematic structural diagram of a training device for a convolutional neural network provided by this application.
  • 16 is a schematic diagram of the hardware structure of a training device for a convolutional neural network provided by an embodiment of the present application;
  • FIG. 17 is a schematic diagram of the hardware structure of an image processing apparatus provided by an embodiment of the present application.
  • Image semantic segmentation is to accurately segment the parts that need to be processed in the image to be processed, and then perform corresponding processing on the different parts of the segmentation.
  • the image segmentation method provided in the embodiments of the present application can be applied to scenes such as photographing, video shooting, and automatic driving. The following briefly introduces the application of the image segmentation method provided in the application embodiment in the photographing scene, the video shooting scene, and the autonomous driving scene.
  • Photo scene The user uses a mobile terminal (such as a mobile phone) to take an image.
  • the mobile terminal automatically segments the target object (such as a portrait) in the captured image to facilitate adding special effects, such as adjusting the depth of field, changing the background, and retaining only the target object The color of the area where the target object is located, blurring the image area outside the area where the target object is located, etc.
  • the user uses the camera function of the mobile terminal to perform real-time image semantic segmentation on the collected images, so that the foreground of the subject is clear and the background is blurred, realizing the effect of a large SLR aperture.
  • the mobile terminal After a user takes an image with a mobile terminal, he can select a portrait whose color needs to be retained, and the mobile terminal only retains the color of the region where the portrait is located in the image. For another example, after a user uses a mobile terminal to capture an image, the mobile terminal automatically segments the target object (such as a portrait) in the captured image, so that the user can view the image in the image excluding the area where the target object is located. Adjust the area, such as adjusting the depth of field, changing the background, etc.
  • Video shooting scenario 1 The user turns on the video shooting function of the mobile terminal. During the video shooting process, the image semantic segmentation is performed in real time. After the portrait area is segmented, only the color of the portrait area is retained to realize the color retention of the video portrait.
  • Video shooting scenario 2 The user turns on the video shooting function of the mobile terminal. When there are multiple people in the subject, all portraits are divided. The user can freely select the target portrait that needs to be kept clear, and the mobile terminal will place the image in the image. Except for the area where the target portrait is located, all parts are blurred to achieve the effect of movie mode.
  • Autopilot scene Autopilot device (such as car) performs image semantic segmentation on the collected image in real time. After segmenting each object in the image, it performs object detection on each segmented object, so as to more accurately identify pedestrians, Obstacles and vehicles, etc.
  • the mobile terminal uses lower-resolution feature maps to perform feature map fusion, which ensures the improvement of the accuracy of image semantic segmentation, while greatly reducing the amount of calculation and memory consumption.
  • the training method of the convolutional neural network provided by the embodiment of this application involves computer vision processing, and can be specifically applied to image processing methods such as data training, machine learning, and deep learning, and performs training data (such as the input image in this application) Symbolized and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally a trained convolutional neural network is obtained; and the image segmentation method provided in the embodiment of this application can use the above-mentioned trained convolutional neural network , Input data (such as the input image in this application) into the trained convolutional neural network to obtain output data (such as the image segmentation result in this application).
  • the training method of the convolutional neural network and the image segmentation method provided by the embodiments of this application are inventions based on the same idea, and can also be understood as two parts in a system or two parts of an overall process. Stages: such as model training stage and model application stage.
  • Convolutional neural network is a deep convolutional neural network with convolution structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve an input image or convolution feature map, and output a convolution feature plane.
  • the convolution feature plane can also be called a feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the weight matrix corresponding to the shared weights is the convolution kernel. Sharing weight can be understood as the way to extract image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts, which means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the backpropagation (BP) algorithm to correct the size of the parameters in the convolutional neural network during the training process, so that the predicted value of the convolutional neural network output is between the actual desired target value
  • the error loss is getting smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial convolutional neural network are updated by backpropagating the error loss information, so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the convolutional neural network, such as the weight matrix, which is the convolution kernel of the convolutional layer.
  • an embodiment of the present invention provides a system architecture 100.
  • the data collection device 160 is used to collect training data.
  • the training data includes: one or more labeled images (ie training samples) and the real data corresponding to the one or more labeled images.
  • the convolutional neural network is used to process the one or more labeled images and the desired result is expected;
  • the training data can be stored in the database 130, and the training device 120 can train based on the training data maintained in the database 130 to obtain the target model/rule 101 (101 is the model trained in the training phase introduced earlier, and can be a convolutional neural network used to implement image and speech segmentation operations).
  • An annotated image corresponds to a real result, that is, ground truth.
  • the target model/rule 101 can be used to implement the image and voice segmentation method provided by the embodiment of the application, that is, the input image
  • the image information obtained after relevant preprocessing is input into the target model/rule 101 to obtain the image segmentation result.
  • the target model/rule 101 in the embodiment of this application may specifically be a convolutional neural network obtained by training.
  • the convolutional neural network is obtained by training an initialized convolutional neural network.
  • the training data maintained in the database 130 may not all come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rule 101 completely based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application. Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1.
  • the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, notebook computers, augmented reality devices (AR), virtual reality devices (virtual reality, VR), in-vehicle terminals, etc., can also be servers.
  • the execution device 110 is configured with an I/O interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data is described in the embodiment of the present application. It may include: an input image, which may be an image collected by the execution device 110 through the data collection device 160, an image in the database 130, or an image from the client device 140.
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (such as the input image) received by the I/O interface 112.
  • the preprocessing module 113 may be used to implement image filtering and image preprocessing enhancement.
  • One or more of the operations of, image preprocessing smoothing, image preprocessing restoration, etc., are also used to implement other preprocessing operations, which are not limited in this application.
  • the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call data, codes, etc. in the data storage system 150 to implement corresponding processing.
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.
  • the I/O interface 112 returns the processing result, such as the image processing result obtained above, to the client device 140, so as to provide it to the user.
  • the training device 120 can train for different goals or tasks and obtain corresponding target models/rules 101 based on different training data.
  • the corresponding target models/rules 101 can be used to achieve the above goals or Complete the above tasks to provide users with the desired results.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140.
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data, and store it in the database 130 as shown in the figure.
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in the database 130.
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention.
  • the positional relationship among the devices, devices, modules, etc. shown in FIG. 1 does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • the target model/rule 101 obtained by training based on training data may be a convolutional neural network for image and speech segmentation tasks.
  • a convolutional neural network is a deep convolutional neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Conduct multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial convolutional neural network. Each neuron in the feed-forward artificial convolutional neural network can respond to the input image.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a convolutional neural network layer 230.
  • the convolutional layer/pooling layer 220 shown in Figure 2 may include layers 221-226 as shown in the examples.
  • layer 221 is a convolutional layer
  • layer 222 is a pooling layer
  • layer 223 is a convolutional layer
  • Layers, 224 is the pooling layer
  • 225 is the convolutional layer
  • 226 is the pooling layer
  • 221 and 222 are the convolutional layers
  • 223 is the pooling layer
  • 224 and 225 are the convolutional layers.
  • Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a convolution kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator Essentially it can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, depending on the value of stride, the weight matrix is usually followed by one pixel in the horizontal direction on the input image. One pixel or two pixels followed by two pixels are processed to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image.
  • the depth dimension is also the channel dimension, which corresponds to the number of channels. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image.
  • one weight matrix is used to extract edge information of the image
  • another weight matrix is used to extract specific colors of the image
  • another weight matrix is used to eliminate unwanted noise in the image. Perform fuzzification, etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size have the same size, and then the multiple extracted feature maps of the same size are combined to form a convolution operation. Output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features, corresponding to high-resolution
  • the feature map As the depth of the convolutional neural network 200 deepens, the features extracted by the subsequent convolutional layers (such as 226) become more and more complex, such as high-level semantic features, corresponding to low resolution
  • the feature map the feature with higher semantics is more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the only purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer can include an average pooling operator and/or a maximum pooling operator, which can be used to sample the input image to obtain a smaller size image, and it can also be used to obtain a comparatively better memory for the feature map input to the convolutional layer. Feature map of small size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (image segmentation result or other related information), the convolutional neural network 200 needs to use the convolutional neural network layer 230 to generate an image segmentation result. Therefore, the convolutional neural network layer 230 can include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific The relevant training data of the task type is obtained through pre-training.
  • the task type may include image semantic segmentation, image classification, image super-resolution reconstruction, and so on.
  • the hidden layer may perform a series of processing on the feature map output by the convolutional layer/pooling layer 220 to obtain the image segmentation result. The process of how to obtain the image segmentation result from the feature map output by the convolutional layer/pooling layer 220 will be detailed later, and will not be described in detail here.
  • the output layer 240 has a loss function similar to the classification cross-entropy, which is specifically used to calculate predictions Error, once the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2 from 210 to 240 direction propagation is forward propagation) is completed, back propagation (as shown in Figure 2 from 240 to 210 direction propagation is reverse propagation) It will start to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200, and between the output of the convolutional neural network 200 through the output layer (ie the above-mentioned image processing result) and the ideal result Error.
  • the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • FIG. 3 is a hardware structure of a chip provided by an embodiment of the present invention.
  • the chip includes a convolutional neural network processor 30.
  • the chip may be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111.
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101.
  • the algorithms of each layer in the convolutional neural network as shown in Figure 2 can be implemented in the chip as shown in Figure 3.
  • the convolutional neural network processor 30 may be a convolutional neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processing unit (GPU), etc. Suitable for large-scale XOR operation processing processors.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processing unit
  • the NPU can be mounted as a coprocessor to a central processing unit (CPU), that is, a main CPU (Host CPU), and the main CPU allocates tasks to it, such as image processing tasks.
  • the core part of the NPU is the arithmetic circuit 303.
  • the arithmetic circuit 303 is controlled by the controller 304 to extract matrix data in the memory (301 and 302) and perform multiplication and addition operations.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 obtains the weight value of the matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit 303.
  • the arithmetic circuit 303 fetches the input data of matrix A from the input memory 301, and performs matrix operations based on the input data of matrix A and the weight value of matrix B, and the partial result or final result of the obtained matrix is stored in an accumulator 308 .
  • the input data can be an input image, and the weight matrix is the convolution kernel.
  • the weight data can also be called a weight matrix.
  • the unified memory 306 is used to store input data and output data.
  • the weight matrix is directly transferred to the weight memory 302 through the direct memory access controller (DMAC) 305 of the storage unit.
  • the input data is also transferred to the unified memory 306 through the DMAC.
  • the output data is the result of image segmentation.
  • the bus interface unit (BIU) 310 is used for the interaction between the DMAC and the instruction fetch buffer 309; the bus interface unit 301 is also used for the instruction fetch buffer 309 to obtain instructions from the external memory; the bus interface unit 301 also The storage unit access controller 305 obtains the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 306, or to transfer the weight data to the weight memory 302, or to transfer the input data to the input memory 301.
  • the vector calculation unit 307 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 307 is mainly used for the calculation of non-convolutional layers or fully connected layers (FC) in the convolutional neural network. Specifically, it can handle: pooling, normalization, etc. .
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 307 generates a normalized value, a combined value, or both.
  • the vector calculation unit 307 stores the processed vector to the unified memory 306.
  • the vector processed by the vector calculation unit 307 can be used as the activation input of the arithmetic circuit 303, for example for use in subsequent layers in a convolutional neural network, as shown in FIG. 2, if the current processing layer is hidden If layer 1 (231) is included, the vector processed by the vector calculation unit 307 can also be used for calculation in hidden layer 2 (232).
  • the fetch memory 309 connected to the controller 304 is used to store instructions used by the controller 304.
  • the unified memory 306, the input memory 301, the weight memory 302, and the fetch memory 309 are all On-Chip memories.
  • the external memory can be independent of the NPU hardware architecture.
  • each layer in the convolutional neural network shown in FIG. 2 can be executed by the arithmetic circuit 303 or the vector calculation unit 307.
  • Embodiment 1 will use Embodiment 1 to describe in more detail how the training device 120 obtains the target model/rule 101 based on training data, that is, how to train based on the training data to obtain a convolutional neural network for implementing the image segmentation method provided in the embodiments of the present application.
  • FIG. 4 is a training method 400 of a convolutional neural network provided by Embodiment 1 of the present invention.
  • the method may include:
  • the training device obtains a transformation matrix.
  • the training device needs to use a transformation matrix in S403, so the training device needs to obtain the transformation matrix.
  • the training device may obtain the transformation matrix from the database 130, may also obtain the transformation matrix from other devices, and may also obtain the transformation matrix from training samples. The method for obtaining the transformation matrix from the training sample will be detailed later.
  • the training device initializes the convolutional neural network.
  • Initializing the convolutional neural network includes initializing the convolution kernel of each layer of the convolutional neural network and the parameters of other layers (such as the pooling layer, the convolutional neural network layer, and the fully connected layer).
  • the training device can use any initialization method, for example, using Gaussian distribution random number sampling, uniform distribution random number sampling and other methods to initialize the convolutional neural network.
  • the training device uses the convolutional neural network to process the training sample to obtain an image segmentation result of the training sample.
  • the image segmentation result of the training sample is used to indicate the region where the target object in the training sample is located.
  • the training sample can be understood as an input image, and the processing performed by the training device using the convolutional neural network on the training sample is the same as the processing performed by the executing device using the convolutional neural network on the input image.
  • the training device may also preprocess the training samples before using the convolutional neural network to process the training samples. For example, image filtering, image preprocessing enhancement, image preprocessing smoothing, and image preprocessing restoration are performed on training samples.
  • the image processing device may also perform other image preprocessing operations on the training sample, which is not limited in this application.
  • Image filtering mainly includes adjusting the image size and performing denoising and smoothing processing on the noise in the zoomed image.
  • Image preprocessing enhancement is to selectively enhance and suppress the information in the image to improve the visual effect of the image, or to transform the image into a form more suitable for machine processing to facilitate data extraction or recognition.
  • Image preprocessing and smoothing is to eliminate random noise in the image.
  • Image preprocessing restoration is to correct image degradation caused by various reasons, so that the reconstructed or estimated image is as close as possible to the ideal image without degradation.
  • the training device determines the loss value corresponding to the training sample according to the image segmentation result of the training sample and the standard result corresponding to the training sample.
  • the standard result (also called the real result) corresponding to the training sample is the expected result of processing the training sample by using the convolutional neural network.
  • the training device may use the loss function corresponding to the image semantic segmentation task performed by the convolutional neural network to calculate the loss value corresponding to the training sample.
  • the loss function defines "how to compare the difference between the predicted value and the target value", that is, the loss function is an important equation used to measure the difference between the predicted value and the target value.
  • the image segmentation result of the training sample corresponds to the predicted value
  • the standard result of the training sample corresponds to the target value. The higher the output value (loss) of the loss function, the greater the difference between the image segmentation result and the standard result, and the training of the convolutional neural network becomes a process of reducing this loss as much as possible.
  • the training device judges whether the convolutional neural network converges.
  • the training device judging whether the convolutional neural network converges can be judging whether the number of times of updating the parameters of the convolutional neural network reaches the iteration threshold, that is, the number of executions of S406; it can also be judging whether the loss value of the convolutional neural network is lower than the loss threshold.
  • the loss value of the convolutional neural network is the error between the image segmentation result output by the convolutional neural network and the standard result calculated by the training device using the loss function of the convolutional neural network.
  • the training tasks of the training equipment are different, and the loss function of the convolutional neural network is also different.
  • the iteration threshold may be the number of iterations preset by the training device, such as 10,000 times, 20,000 times, and so on.
  • the loss threshold may be preset by the training device. If the difference between the image processing result output by the convolutional neural network and the standard result is less than the loss threshold, the training ends.
  • the training device uses the loss value corresponding to the training sample to update the parameters of the convolutional neural network through an optimization algorithm.
  • the training device can use the obtained loss value to update the parameters of the convolutional neural network through a back propagation algorithm.
  • the stochastic gradient descent algorithm is used to update the parameters of the convolutional neural network with the loss value corresponding to the training sample.
  • the method 400 may be specifically executed by the training device 120 shown in FIG. 1, and the input image (ie, training sample) in the method 400 may be training data maintained in the database 130 shown in FIG. 1.
  • image preprocessing may be performed on the training samples, and the training samples processed in S403 are training samples after image preprocessing.
  • the image preprocessing operation of the training samples can be performed in the training device 120, or it can be pre-executed by other functional modules before being input to the training device 120, that is, image preprocessing is performed on the training samples received or acquired from the database 130 first.
  • S401 a training sample after image preprocessing is obtained as input to the training device 120, and the training device 120 executes S401 to S407.
  • the method 400 may be processed by a CPU, or jointly processed by a CPU and a graphics processing unit (GPU), or may not use a GPU and use other processors suitable for convolutional neural network calculations. , This application is not restricted.
  • a convolutional neural network that uses a lower-resolution feature map to obtain accurate image segmentation results can be trained, and the execution device uses the trained convolutional neural network for graphics and speech segmentation, which can greatly reduce the amount of calculation. And memory consumption.
  • FIG. 5 is a flowchart of a method for generating a transformation matrix from training samples according to an embodiment of the application, and the method may include:
  • the training device divides each training sample (that is, annotated image) in the training data into a plurality of sub-samples corresponding to the (A ⁇ B ⁇ N) three-dimensional matrix.
  • Each sub-sample can be understood as a small block in the training sample, that is, a part of the training sample.
  • Each training sample in the training data is a (H ⁇ W ⁇ N) three-dimensional matrix.
  • the training device can perform block operations on each training sample, that is, divide the (H ⁇ W ⁇ N) three-dimensional matrix corresponding to each training sample into multiple (A ⁇ B ⁇ N) sub-matrices (sub-samples).
  • FIG. 6 is a schematic diagram of a calculation process of a transformation matrix provided by an embodiment of the application. As shown in Fig.
  • a (H ⁇ W ⁇ N) three-dimensional matrix that is, a training sample, can be divided into multiple sub-samples, and each sub-sample corresponds to a (A ⁇ B ⁇ N) three-dimensional matrix.
  • A, B, and N are all integers greater than zero.
  • N is the number of semantically segmented categories of images in each training sample.
  • the training device arranges each sub-sample into a vector including (A ⁇ B ⁇ N) elements.
  • the training device rearranges each sub-sample (small block in FIG. 6) to obtain a vector including (4 ⁇ N) elements.
  • the training device can obtain a vector including (A ⁇ B ⁇ N) elements from a sub-sample.
  • the training device performs principal component analysis on all obtained vectors including (A ⁇ B ⁇ N) elements to obtain an intermediate matrix of (A ⁇ B ⁇ N) ⁇ (A ⁇ B ⁇ N).
  • the intermediate matrix is a two-dimensional matrix.
  • Principal component analysis is a statistical method that transforms a group of potentially correlated variables into a group of linearly uncorrelated variables through orthogonal transformation.
  • the transformed group of variables is called principal component.
  • the steps for the training device to implement S503 may be as follows: (1) The training device merges all obtained vectors including (A ⁇ B ⁇ N) elements into a Q ⁇ (A ⁇ B ⁇ N) two-dimensional matrix X'. (2) Normalize X'(mean value is 0, standard deviation is 1) to obtain a normalized two-dimensional matrix X. (3) Perform singular value decomposition on the two-dimensional matrix X to obtain the (P ⁇ P) intermediate matrix.
  • Q is the number of all vectors including (A ⁇ B ⁇ N) elements.
  • the formula for singular value decomposition of X is as follows:
  • the columns of U and V are called left-singular vectors and right-singular vectors of X, respectively, and the values on the diagonal of S are called singular values of X.
  • the column of U consists of the unitized eigenvectors of XX T; the column of V consists of the unitized eigenvectors of X T X; the diagonal elements of S are derived from the square root of the eigenvalues of XTX or X X T, and They are arranged in descending order.
  • Singular value decomposition is a method of matrix decomposition. SVD is a commonly used method and will not be detailed here.
  • the training device extracts the front C-dimensional principal components from the intermediate matrix to obtain a final transformation matrix.
  • the transformation matrix is a (C ⁇ (A ⁇ B ⁇ N)) two-dimensional matrix.
  • a and B are 2, and the transformation matrix is a (C ⁇ 4N) two-dimensional matrix.
  • the transformation matrix may be a sub-matrix corresponding to the first C rows of the intermediate matrix.
  • the transformation matrix is generated from the training sample, so that a lower resolution feature map can be used to obtain an accurate image segmentation result.
  • FIG. 7 is an image segmentation method provided by an embodiment of the application, and the method may include:
  • the image processing device obtains an input image and processing requirements.
  • the two-dimensional matrix of the input image on the channel is a matrix of (H ⁇ A) ⁇ (W ⁇ B).
  • the image processing device is the aforementioned execution device. H, W, A, and B are all integers greater than zero.
  • the input image obtained by the image processing device may be obtained by the image processing device using a camera, the input image may also be obtained from client equipment or a database, or the input image may be obtained by other means, which is not limited in this application.
  • the processing requirement may be input by the user, or pre-configured by the image processing device. The processing requirement is used to instruct the target feature map group (ie, the image segmentation result) obtained by performing image segmentation on the input image to perform target processing to obtain a target image.
  • the processing requirement may indicate that the area in the input image other than the area where the target object is located is adjusted, such as adjusting the depth of field, changing the background, etc.; it may indicate determining the area where the portrait in the input image is located, and only retain the area The color of the area where the portrait is located; it can also indicate other processing on the input image, which is not limited in this application.
  • the target feature map group and the image segmentation result are the same concept.
  • the image processing device can determine the regions required by different objects in the input image, such as the regions required by the portrait, according to the target feature map group.
  • the image processing device performs multi-level feature extraction on the input image to obtain multiple feature maps.
  • the image processing device down-samples the multiple feature maps to obtain multiple feature maps with reference resolutions.
  • the reference resolution is lower than the resolution of the input image.
  • the image processing device merges the multiple feature maps with reference resolution to obtain at least one feature map group.
  • the image processing device uses the transformation matrix W to up-sample the feature map group to obtain the target feature map group.
  • the target feature map group and the input image have the same resolution; wherein, the transformation matrix W is obtained by modeling the training data of the image segmentation task; one of the dimensions of the transformation matrix W is the same as the feature group The number of channels is the same.
  • the above-mentioned target feature map group is the image segmentation result obtained by image segmentation of the input image.
  • the target feature map group is used to indicate the area where the target object in the input image is located.
  • the target object can be a portrait in the input image, or a preset detection object (such as a cat, dog, etc.), or The object in the input image selected by the user.
  • the image processing device performs target processing on the target feature map group according to the processing requirement to obtain a target image.
  • the foregoing target processing on the target feature map group may be to determine the area where different objects in the input image are located according to the target feature map group, and then perform target processing on a certain area. For example, after the image processing device determines the area where the subject is located in the input image according to the target feature map group, the foreground of the subject is clear and the background is blurred, so as to achieve the effect of a large SLR aperture. For another example, after a user takes an image with an image processing device, he can select a portrait whose color needs to be preserved (that is, processing requirements). The image processing device performs image semantic segmentation on the image, and determines where the portrait is based on the obtained image segmentation result.
  • the image processing device performs image semantic segmentation on the captured image, and determines the area of the target object (such as a portrait) in the image according to the image segmentation result, so that the The user adjusts the area of the image other than the area where the target object is located, such as adjusting the depth of field, changing the background, and so on.
  • the steps of the image processing device performing S702-S703 may be as follows: the image processing device uses a convolutional neural network to process the input image to obtain feature maps of K reference resolutions.
  • FIG. 8 is a schematic diagram of using a convolutional neural network to process an input image to obtain feature maps of K reference resolutions according to an embodiment of the application. Referring to FIG.
  • the image processing device uses a convolutional neural network to process the input image to obtain feature maps of K reference resolutions in an implementation manner as follows: perform convolution on the input image to obtain a first feature map, and The feature map is convolved to obtain the second feature map, and so on, until the (K-1)th feature map is convolved to obtain the Kth feature map; the first feature map is down-sampled to obtain a reference resolution The feature map of the second feature map is down-sampled to obtain a feature map of the reference resolution, and so on until the (K-1)th feature map is down-sampled to obtain a feature map of the reference resolution.
  • the Kth feature map is a reference resolution feature map; the resolution of the (K-1)th feature map is not higher than the resolution of the Kth feature map, and K is an integer greater than 1.
  • the feature maps in the dashed box in FIG. 8 are the feature maps of K reference resolutions obtained by processing the input image. In this implementation manner, the resolutions of the first feature map to the Kth feature map are sequentially reduced.
  • the convolutional neural network may include multiple convolutional layers (corresponding to the convolution module) and down-sampling layer (corresponding to the down-sampling module), and the feature map output by the previous convolutional layer is the input of the next convolutional layer.
  • the image processing device can use the convolutional layer to perform convolution operations on the input image to obtain a feature map, and continue to use the convolutional layer to perform convolution operations on the obtained feature map to obtain a new feature map, and continue to operate until it reaches
  • K feature maps with different resolutions are obtained.
  • the image processing device uses different convolution kernels to perform convolution operations on the same feature map to obtain different feature maps.
  • different feature maps in the K feature maps can be obtained from the same feature map. This application does not limit the way the image processing device obtains K feature maps from the input image.
  • the image processing device calculates the input image of the (l-1)th convolutional layer (that is, the feature map of the convolutional layer input) and convolves with the convolution kernel, adding a bias Then, through the activation function f, the feature map is obtained
  • the M j in formula (1) represents a series of input images connected with the jth neuron, (*) represents the convolution operation, and ⁇ ( ⁇ ) represents the sum operation.
  • the activation function f can be a sigmoid function, a tanh function, a ReLU function or other types of activation functions, which are not limited in this application.
  • the image processing device can use down-sampling methods such as bilinear interpolation, nearest neighbor interpolation, median interpolation, mean interpolation, etc., to downsample each feature map output by the convolutional layer (that is, the convolution module) to reduce each feature
  • the resolution of the map makes the resolution of each feature map consistent with the resolution of the feature map output by the last convolutional layer.
  • the zoom factor (also proportional factor) of the original image is t (0 ⁇ t ⁇ 1), that is, the original image is reduced by 1/t times
  • x1 x/t
  • y1 y/t
  • the linear interpolation algorithm can get the gray level of this pixel P'(x1, y1).
  • step (2) Repeat step (2) until the values of all pixels of the new image are determined.
  • the steps for the image processing device to implement S704-S705 can be as follows: the image processing device uses the convolutional neural network to calculate the product of (H ⁇ W) one-dimensional matrices each including C elements and the transformation matrix to obtain (H ⁇ W) average A one-dimensional matrix including P elements; the image processing device uses a convolutional neural network to perform feature arrangement on the (H ⁇ W) one-dimensional matrices each including P elements to obtain a target feature map group.
  • Each of the (H ⁇ W) one-dimensional matrices including C elements includes each of the C (H ⁇ W) two-dimensional matrices corresponding to the K reference resolution feature maps. Elements at the same position in a two-dimensional matrix.
  • the C (H ⁇ W) two-dimensional matrices correspond to one (H ⁇ W ⁇ C) three-dimensional matrix.
  • FIG. 9 is a schematic diagram of an upsampling process provided by an embodiment of the application.
  • the (H ⁇ W ⁇ C) three-dimensional matrix in FIG. 9 is the three-dimensional matrix corresponding to the C (H ⁇ W) two-dimensional matrices. As shown in Figure 9, the position of each element in the (H ⁇ W ⁇ C) three-dimensional matrix corresponds to a one-dimensional matrix including C elements in the channel dimension.
  • the black columnar area in the three-dimensional matrix corresponds to one A one-dimensional matrix consisting of C elements.
  • the C (H ⁇ W) two-dimensional matrices correspond to (H ⁇ W) one-dimensional matrices each including C elements, and each C-element one-dimensional matrix is multiplied by the transformation matrix to obtain one including A one-dimensional matrix of P elements.
  • H, W, C, N, P, M, K, A, and B are all integers greater than zero.
  • the image processing device uses a convolutional neural network to perform feature arrangement on the (H ⁇ W) one-dimensional matrices each including P elements to obtain the image segmentation results in the following manner:
  • the image processing device uses a convolutional neural network according to each including A one-dimensional matrix with P elements, determine (A ⁇ B) one-dimensional matrices each including N elements; use a one-dimensional matrix including P elements to obtain (A ⁇ B) one-dimensional matrices each including N elements A one-dimensional matrix is used to obtain a (A ⁇ B ⁇ N) three-dimensional matrix; each (A ⁇ B ⁇ N) three-dimensional matrix is used as a sub-matrix included in the image segmentation result.
  • each one-dimensional matrix including P elements can obtain a (A ⁇ B ⁇ N) three-dimensional matrix after feature arrangement, which is used as a sub-matrix included in the image segmentation result.
  • the ((H ⁇ A) ⁇ (W ⁇ B) ⁇ N) three-dimensional matrix in Fig. 9 is the image segmentation result.
  • the image processing device may use each one-dimensional matrix including P elements to obtain a (A ⁇ B ⁇ N) three-dimensional matrix as part of the image segmentation result.
  • the image processing device can sequentially process each one-dimensional matrix including P elements to obtain a (A ⁇ B ⁇ N) three-dimensional matrix, and use it as a sub-matrix included in the image segmentation result to finally obtain the image segmentation result.
  • the image processing device uses a convolutional neural network to perform convolution operation and down-sampling on the input image to obtain multiple lower-resolution feature maps, and uses the lower-resolution feature maps to perform feature arrangement to obtain image segmentation.
  • the steps of the image processing apparatus to implement S704 may be as follows: the image processing apparatus splices the K feature maps with reference resolutions in the channel dimension to obtain a fusion feature map, that is, ( H ⁇ W ⁇ C) three-dimensional matrix.
  • the image processing device respectively calculates the product of the one-dimensional matrix corresponding to each element position in the fusion feature map and the transformation matrix to obtain the (H ⁇ W) one-dimensional matrices each including P elements.
  • the one-dimensional matrix corresponding to an element position in the fusion feature map includes elements at the same element position in each of the C (H ⁇ W) two-dimensional matrices.
  • the fusion feature map is a (H ⁇ W ⁇ C) three-dimensional matrix and corresponds to C (H ⁇ W) two-dimensional matrices.
  • the fusion feature map is at least one feature map group obtained by fusing multiple feature maps with reference resolution in step S704. FIG.
  • FIG. 10 is a schematic diagram of a feature map fusion process and an upsampling process provided by an embodiment of this application.
  • the feature map in the rectangular frame formed by the dashed line is the feature map of the above K reference resolutions of the image processing device;
  • the (H ⁇ W ⁇ C) three-dimensional matrix is the fusion of the K reference resolutions
  • the three-dimensional matrix of ((H ⁇ A) ⁇ (W ⁇ B) ⁇ N) is the image segmentation result obtained by up-sampling the fusion feature map.
  • the up-sampling in FIG. 10 is the up-sampling in FIG. 9.
  • a hidden layer (corresponding to the feature map fusion module) of the convolutional neural network is used to fuse the fused feature map obtained by the K reference resolution feature maps.
  • a hidden layer of the convolutional neural network (corresponding to the up-sampling module) is used to up-sample the fused feature map to obtain the image segmentation result.
  • the image processing device may splice the K reference resolutions according to the channel dimensions.
  • n*Channel*H*W For any feature map, there is a description of the following dimensions: n*Channel*H*W. Among them, H and W represent the length and width of the feature map respectively; n represents the number of images input to the entire convolutional neural network, and Channel represents the number of channels.
  • the image processing device may stitch two or more feature maps in Channe l dimension (ie, channel dimension) or n dimension.
  • the function of the feature map fusion module is to splice two or more feature maps in Channe l dimension (ie, channel dimension) or n dimension.
  • the channel dimension can be different, and the other dimensions must be the same (that is, n, H, and W are the same).
  • n is set to 1
  • the operation of the image processing device is only to add Channel 1 of the feature map 1 to Channel 2 of the feature map 2, and the dimension of the resulting fusion feature map is :N*(Channel 1+Channel 2)*H*W.
  • FIG. 12 is another image segmentation method provided by an embodiment of this application, and the method may include:
  • the image processing device acquires an input image.
  • the image processing device uses each convolutional layer of the convolutional neural network to perform a convolution operation on the input image to obtain K feature maps.
  • the multiple feature maps correspond to the aforementioned first feature maps to Kth feature maps.
  • the image processing device can use the first layer of convolutional layer to perform convolution operation on the input image to obtain a feature map, and continue to use the second layer of convolutional layer to perform convolution operation on the obtained feature map to obtain a new feature map, and continue to operate, Until the specified number of convolution operations is reached, K feature maps with different resolutions are obtained. That is to say, the feature map output by the previous convolutional layer is the input of the subsequent convolutional layer, and the feature maps output by each convolutional layer of the convolutional neural network constitute the K feature maps.
  • the convolution operation performed by each convolution layer can be referred to formula (1).
  • the image processing device down-samples the (K-1) feature maps in the K feature map to obtain (K-1) feature maps with reference resolution.
  • the (K-1) feature maps are feature maps of the K feature map except for the feature maps output by the last layer of the convolutional neural network. Refer to Figure 8 for the down-sampling process.
  • the image processing device fuses the (K-1) reference resolution feature maps and the feature maps output by the last layer of the convolutional neural network to obtain a fused feature map.
  • S1204 corresponds to the fusion operation in FIG. 10 and FIG. 11.
  • the image processing device up-samples the fusion feature map to obtain an image segmentation result.
  • S1205 corresponds to the up-sampling operation in FIG. 10.
  • the hidden layer of the convolutional neural network can implement the down-sampling operation in S1203, the fusion operation in S1204, and the up-sampling operation in S1205.
  • the image processing device can perform further processing according to the image segmentation result.
  • the image processing device may be a mobile terminal, such as a mobile phone.
  • the user uses the camera function of the mobile terminal to perform real-time image semantic segmentation on the captured image to obtain the image segmentation result.
  • the mobile terminal determines the area of the subject in the image according to the image segmentation result, let The foreground of the subject is clear and the background is blurred, realizing the effect of a large SLR aperture.
  • a user takes an image with a mobile terminal, he can select a portrait whose color needs to be preserved.
  • the mobile terminal performs image semantic segmentation on the image, and determines the region required by the portrait according to the obtained image segmentation result, and then retains only The color of the area in the image where the portrait is located. For another example, after a user uses a mobile terminal to capture an image, the mobile terminal performs image semantic segmentation on the captured image, and determines the area where the target object (such as a portrait) in the image is located according to the image segmentation result, so that the user can The area of the image other than the area where the target object is located is adjusted, such as adjusting the depth of field, changing the background, and so on. For another example, the user turns on the video shooting function of the mobile terminal. During the video shooting process, the mobile terminal performs image semantic segmentation in real time.
  • the portrait area is determined according to the image segmentation result, only the color of the portrait area is retained to realize the video The portrait stays in color.
  • the user turns on the video shooting function of the mobile terminal, and the mobile terminal performs image semantic segmentation in real time.
  • the mobile terminal divides all human figures according to the image segmentation result. Click to keep a clear target portrait, and the mobile terminal blurs all parts of the image except the area where the target portrait is located to achieve the effect of the movie mode.
  • an automatic driving device (such as a car) performs image semantic segmentation on the collected image in real time.
  • object detection is performed on each segmented object to facilitate update Accurately identify pedestrians, obstacles and vehicles. It can be understood that the image processing apparatus can use the image segmentation result to accurately determine the area where each object in the input image is located, so as to perform different processing on different objects or different areas in the image.
  • FIG. 13 is a schematic structural diagram of an image processing apparatus provided by this application. As shown in FIG. 13, the image processing apparatus 1300 may include:
  • the obtaining unit 1301 is used to obtain an input image and a processing requirement;
  • the two-dimensional matrix of the input image on the channel is a matrix of (H ⁇ A) ⁇ (W ⁇ B);
  • the processing requirement is used to indicate the input image
  • the target feature map group (ie image segmentation result) obtained by image segmentation is subjected to target processing to obtain the target image;
  • the processing unit 1302 is configured to perform multi-level feature extraction on the input image to obtain multiple feature maps; down-sampling the multiple feature maps to obtain multiple feature maps with reference resolution; the reference resolution The rate is lower than the resolution of the input image; the multiple feature maps with reference resolution are fused to obtain at least one feature map group; the transformation matrix W is used to upsample the feature map group to obtain the target feature The target feature map group and the input image have the same resolution; wherein, the transformation matrix W is obtained by modeling the training data of the image segmentation task; one of the dimensions of the transformation matrix W has the same resolution as The number of channels of the feature groups is the same; according to the processing requirements, target processing is performed on the target feature map group to obtain a target image.
  • the function of the acquiring unit 1301 may be implemented by a camera or an I/O interface in the image processing device.
  • the function of the processing unit 1302 may be implemented by the CPU in the image processing device, or may be implemented by the CPU in cooperation with other processors (for example, NPU, TPU, GPU, etc.).
  • the processing unit 1302 may include:
  • the convolution module 1401 is configured to perform a convolution operation on the input image and/or feature map to obtain a feature map, and output the obtained feature map to the next convolution layer;
  • the down-sampling module 1402 is used to down-sample the feature maps output by each convolution module to obtain a feature map with a reference resolution;
  • the feature map fusion module 1403 is used to fuse feature maps of various reference resolutions to obtain a fusion feature map
  • the up-sampling module 1404 is used to perform feature arrangement on the fusion feature map to obtain the image segmentation result.
  • the convolution module 1401 is used to implement the convolution operation of each convolution layer in the convolutional neural network. Refer to the convolution operation in FIG. 8.
  • the image processing device includes a convolution module, and the convolution module implements the convolution operation of each convolution layer.
  • the image processing device includes K convolution modules, and each convolution module is used to implement a convolution operation of a convolution layer.
  • the down-sampling module 1402 is used to implement the down-sampling in FIG. 8, that is, down-sampling each feature map except the feature map output by the last layer of the convolutional layer to obtain a feature map with a reference resolution.
  • the feature map fusion module 1403 is used to implement the feature map fusion operation in FIG. 10 and FIG. 11.
  • the up-sampling module 1404 is used to implement the up-sampling operation in FIG. 10.
  • the convolution module 1401, the down-sampling module 1402, the feature map fusion module 1403, and the up-sampling module 1404 can all be implemented by software, or all can be implemented by hardware, and one part can be implemented by software and the other part can be implemented by hardware.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the above-mentioned computer program product includes one or more computer instructions.
  • the foregoing computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the above-mentioned usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive (SSD).
  • the image processing device runs the software code stored in the memory of the image processing device to realize the functions of the convolution module 1401, the down-sampling module 1402, the feature map fusion module 1403, and the up-sampling module 1404, that is, the function of the processing unit 1302.
  • the image processing device runs the hardware code solidified in the processor of the image processing device to implement the aforementioned image segmentation method.
  • the architecture of encoding and then decoding the image is a commonly used image processing method in computer vision tasks, and many computer vision technologies use this framework.
  • the image processing device also uses an architecture that encodes and then decodes an image, that is, a convolutional neural network with an encoder-decoder architecture is used to process image semantic segmentation tasks.
  • the convolutional neural network can be divided into two parts: the encoder and the decoder.
  • the encoder includes the convolution module 1401 and the down-sampling module 1402 in Figure 14, and the decoder includes the feature map fusion module 1403 and the up-sampling module in Figure 14. 1404.
  • the solution provided by this application has at least two specific advantages:
  • the decoder module can only select low-layer feature maps with high resolution for feature map aggregation.
  • the high-level low-resolution feature map is up-sampled and then merged with the low-level high-resolution feature map.
  • the low-level high-resolution feature map is down-sampled and then directly merged with the high-level low-resolution feature map, as shown in Figure 9 and Figure 10.
  • a data-related up-sampling module is used to retain the original structure information of the input picture, and the segmentation accuracy is improved.
  • the decoder module selects a high-resolution low-layer feature map for feature map fusion. Since the calculation amount of the convolutional neural network depends on the resolution of the feature map, the use of low-level feature maps for feature map fusion will significantly increase the calculation amount of the convolutional neural network. Therefore, the existing technical solutions have a large amount of calculation and cannot be used in mobile phones. The terminal runs in real time. In this solution, a lower resolution feature map is selected for feature map fusion, which guarantees the improvement of segmentation accuracy and greatly reduces the amount of calculation and memory consumption.
  • FIG. 15 is a schematic structural diagram of a training device for a convolutional neural network provided by this application.
  • the image processing apparatus 1500 may include:
  • the obtaining unit 1501 is configured to obtain the above transformation matrix
  • the processing unit 1502 is used to process the training sample using the convolutional neural network to obtain the image segmentation result of the training sample; determine the loss corresponding to the training sample according to the image segmentation result of the training sample and the standard result corresponding to the training sample ; Use the loss corresponding to the training sample to update the parameters of the convolutional neural network through an optimization algorithm.
  • the training sample includes at least one of the above-mentioned N labeled images; the standard result is an expected result obtained by using the neural network to process the training sample.
  • the training device uses training samples to train the convolutional neural network, which can quickly train a convolutional neural network that can be used to process image semantic segmentation tasks.
  • FIG. 16 is a schematic diagram of the hardware structure of a training device for a convolutional neural network provided by an embodiment of the present application.
  • the training device 1600 of the convolutional neural network shown in FIG. 16 includes a memory 1601, a processor 1602, a communication interface 1603, and a bus 1604.
  • the memory 1601, the processor 1602, and the communication interface 1603 implement communication connections between each other through the bus 1604.
  • the memory 1601 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1601 may store a program. When the program stored in the memory 1601 is executed by the processor 1602, the processor 1602 and the communication interface 1603 are used to execute each step of the training method of the convolutional neural network in the embodiment of the present application.
  • the processor 1602 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), GPU or one or more integrated circuits for executing related programs.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • GPU GPU
  • one or more integrated circuits for executing related programs In order to realize the functions required by the units in the training device of the convolutional neural network of the embodiment of the present application, or execute the training method of the convolutional neural network of the method embodiment of the present application.
  • the processor 1602 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the training method of the convolutional neural network of the present application can be completed by the integrated logic circuit of hardware in the processor 1602 or instructions in the form of software.
  • the aforementioned processor 1602 may also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , Discrete gates or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application-specific integrated circuit
  • FPGA field Programmable Gate Array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1601, and the processor 1602 reads the information in the memory 1601, and combines its hardware to complete the functions required by the units included in the training device of the convolutional neural network of the embodiment of this application, or execute the method implementation of this application Example of the training method of convolutional neural network.
  • the communication interface 1603 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 1600 and other devices or communication networks.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 1600 and other devices or communication networks.
  • training data (such as the training samples described in Embodiment 1 of the present application) can be obtained through the communication interface 1603.
  • the bus 1604 may include a path for transferring information between various components of the device 1600 (for example, the memory 1601, the processor 1602, and the communication interface 1603).
  • the acquisition unit 1501 in the training device 1500 of the convolutional neural network is equivalent to the communication interface 1603 in the training device 1600 of the convolutional neural network
  • the processing unit 1502 can be equivalent to the processor 1602.
  • FIG. 17 is a schematic diagram of the hardware structure of an image processing apparatus provided by an embodiment of the present application.
  • the image processing apparatus 1700 shown in FIG. 17 includes a memory 1701, a processor 1702, a communication interface 1703, and a bus 1704.
  • the memory 1701, the processor 1702, and the communication interface 1703 implement communication connections between each other through the bus 1704.
  • the memory 1701 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory.
  • the memory 1701 may store a program. When the program stored in the memory 1701 is executed by the processor 1702, the processor 1702 and the communication interface 1703 are used to execute each step of the image segmentation method of the embodiment of the present application.
  • the processor 1702 may adopt a general-purpose central processing unit, a microprocessor, an application specific integrated circuit, a graphics processing unit (GPU), or one or more integrated circuits for executing related programs to implement the embodiments of the present application
  • the processor 1702 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the image segmentation method of the present application can be completed by hardware integrated logic circuits in the processor 1702 or instructions in the form of software.
  • the aforementioned processor 1702 may also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , Discrete gates or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • FPGA Field Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1701, and the processor 1702 reads the information in the memory 1701, and combines its hardware to complete the functions required by the units included in the image processing apparatus of the embodiment of the present application, or perform the image segmentation of the method embodiment of the present application method.
  • the communication interface 1703 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 1700 and other devices or communication networks. For example, training data (such as the input image described in the second embodiment of the present application) can be obtained through the communication interface 1703.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 1700 and other devices or communication networks.
  • training data (such as the input image described in the second embodiment of the present application) can be obtained through the communication interface 1703.
  • the bus 1704 may include a path for transferring information between various components of the device 1700 (for example, the memory 1701, the processor 1702, and the communication interface 1703).
  • the acquiring unit 1301 in the image processing device 1300 is equivalent to the communication interface 1703 in the image processing device 1700; the processing unit 1301 in the image processing device 1300 may be equivalent to the processor 1702.
  • the devices 1600 and 1700 shown in FIG. 16 and FIG. 17 only show a memory, a processor, and a communication interface, in a specific implementation process, those skilled in the art should understand that the devices 1600 and 1700 also include implementations. Other devices necessary for normal operation. At the same time, according to specific needs, those skilled in the art should understand that the apparatuses 1600 and 1700 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatuses 1600 and 1700 may also only include the necessary devices for implementing the embodiments of the present application, and not necessarily all the devices shown in FIG. 16 or FIG. 17.
  • the apparatus 1600 is equivalent to the training device 120 in 1
  • the apparatus 1700 is equivalent to the execution device 110 in FIG. 1.
  • a person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了人工智能领域的一种图像分割方法,该方法包括:获得输入图像和处理需求;对所述输入图像进行多层次的特征提取,得到多个特征图;对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;利用变换矩阵W对所述特征图组进行上采样,得到目标特征图组;根据所述处理需求,对所述目标特征图组进行目标处理,得到目标图像。本申请中,使用变换矩阵对较低分辨率的特征图融合得到的特征图组进行上采样得到目标特征图组,能够有效减少内存占用以及计算量,图像分割的精度较高。

Description

图像分割方法和图像处理装置
本申请要求于2019年03月01日提交中国国家知识产权局、申请号为201910157603.5、申请名称为“图像分割方法和图像处理装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机视觉领域,尤其涉及一种图像分割方法和图像处理装置。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断和军事等领域中各种智能系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入图像,再由计算机来代替大脑对这些输入图像完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
计算机视觉技术中关于图像理解的重要一环是图像语义分割(semantic segmentation)。越来越多的应用场景需要采用精确且高效的图像语义分割技术,例如拍照、视频播放、自动驾驶、室内导航、甚至虚拟现实与增强现实等应用场景。图像语义分割是将输入图像中需要处理的部分精确地分割出来,进而对分割出的不同部分执行相应的处理。举例来说,用户可以利用移动终端拍摄图像,然后将拍摄的图像中的人像区域自动分割出来,以便于添加特效,例如调整景深、换背景、仅保留人像区域的颜色、对人像区域之外的图像区域进行虚化等。
目前,利用卷积神经网络(convolutional neuron network,CNN)来处理图像语义分割任务是业界比较普遍的方案。在该方案中,利用CNN先对输入图像进行编码(下采样),再进行解码(上采样)和融合操作,得到最终的图像分割结果。然而,在该方案中,需要使用分辨率较高的特征图进行融合,计算量高、内存开销大。分辨率越高的特征图包含的参数越多。因此,需要研究计算量较少以及内存开销较少的图像语义分割方案。
发明内容
本申请实施例提供了一种图像分割方法和图像处理装置,对高层特征图进行信息融合,可以提高分割精度提升,并减少计算量和内存开销。
第一方面,本申请实施例提供了一种图像分割方法,该方法包括:获得输入图像和处理需求,所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组进行目 标处理;对所述输入图像进行多层次的特征提取,得到多个特征图;对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;利用变换矩阵W对所述特征图组进行上采样,得到所述目标特征图组,所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同;根据所述处理需求,对所述目标特征图组进行所述目标处理,得到目标图像。
本申请实施例中,使用变换矩阵对较低分辨率的特征图融合得到的特征图组进行上采样得到目标特征图组,能够有效减少内存占用以及计算量,图像分割的精度较高。
在一个可选的实现方式中,所述利用变换矩阵W对所述特征图组进行上采样,得到目标特征图组包括:分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵;所述(H×W)个均包括C个元素的一维矩阵中任一矩阵包括的元素为所述特征图组包括的C个(H×W)的二维矩阵中的每个二维矩阵中同一位置的元素,H和W为所述特征图组的两个维度,C为所述特征图组的通道数;所述变换矩阵为由所述训练数据包括的M个标注图像得到的(C×P)的二维矩阵,P=A×B×N,N为所述M个标注图像中的图像语义被分割的类别数;分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组;所述目标特征图组包括的至少一个(A×B×N)的子矩阵为由所述(H×W)个均包括P个元素的一维矩阵中的一个矩阵得到的;其中,H、W、C、N、P、M、A以及B均为大于0的整数。
在该实现方式中,使用变换矩阵对较低分辨率的特征图融合得到的特征图组进行上采样,可以快速得到输入图像的图像分割结果,操作简单。
在一个可选的实现方式中,所述分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组包括:根据所述(H×W)个均包括P个元素的一维矩阵中的任一矩阵,确定(A×B)个均包括N个元素的一维矩阵;将由所述(A×B)个均包括N个元素的一维矩阵得到的一个(A×B×N)的三维矩阵作为所述目标特征图组包括的一个子矩阵。
在该实现方式中,利用(H×W)个均包括P个元素的一维矩阵进行特征排列得到目标特征图组,实现简单。
在一个可选的实现方式中,所述M个标注图像中任一标注图像为一个(H×W×N)的三维矩阵,所述变换矩阵W为采用如下操作得到的:分别获取所述M个标注图像中的每个标注图像对应的至少一个(A×B×N)的子矩阵以得到多个(A×B×N)的子矩阵;由所述多个(A×B×N)的子矩阵得到多个包括P个元素的向量;其中,由所述多个(A×B×N)的子矩阵中的每一个子矩阵得到一个包括P个元素的向量;将所述多个包括P个元素的向量进行主成分分析以得到一个(P×P)的二维矩阵;将所述(P×P)的二维矩阵包括的一个(C×P)的子矩阵作为所述变换矩阵W。
在该实现方式中,使用标注图像得到变换矩阵,以便于利用该变换矩阵对对较低分辨率的特征图融合得到的特征图组进行上采样。
在一个可选的实现方式中,所述对所述输入图像进行多层次的特征提取,得到多个特 征图包括:对所述输入图像进行卷积操作得到第一特征图,对第(K-1)特征图进行卷积操作得到第K特征图;所述第K特征图为一个所述参考分辨率的特征图,所述第(K-1)特征图的分辨率不高于所述第K特征图的分辨率,K为大于1的整数,所述多个特征图包括K个特征图;所述对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图包括:对所述第一特征图进行下采样得到一个所述参考分辨率的特征图,以及对所述第(K-1)特征图进行下采样得到一个所述参考分辨率的特征图。
在该实现方式中,可以快速得到多个参考分辨率的特征图,实现简单。
在一个可选的实现方式中,所述对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组包括:将所述多个具有参考分辨率的特征图在通道维度上进行拼接以得到所述至少一个特征图组;所述特征图组为一个(H×W×C)的三维矩阵且对应所述C个(H×W)的二维矩阵;所述分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵包括:分别计算所述特征图组中的每个元素位置对应的一维矩阵与所述变换矩阵的乘积,得到所述(H×W)个均包括P个元素的一维矩阵;所述特征图组中的一个元素位置对应的一维矩阵包括的元素为所述C个(H×W)的二维矩阵中的每个二维矩阵中同一元素位置的元素。
在该实现方式中,分别计算特征图组中的每个元素位置对应的一维矩阵与变换矩阵的乘积,得到(H×W)个均包括P个元素的一维矩阵,以便于利用该(H×W)个均包括P个元素的一维矩阵进行特征排列以得到目标特征图组,操作简单。
在一个可选的实现方式中,所述方法还包括:获得所述变换矩阵W;使用卷积神经网络对训练样本做处理,得到所述训练样本的图像分割结果;所述训练样本包含于所述训练数据;根据所述训练样本的图像分割结果和所述训练样本对应的标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述卷积神经网络处理所述训练样本期望得到的结果;利用所述训练样本对应的损失,通过优化算法更新所述卷积神经网络的参数;所述对所述输入图像进行多层次的特征提取,得到多个特征图包括:将所述输入图像输入到所述卷积神经网络进行多层次的特征提取,得到所述多个特征图。
在该实现方式中,训练得到卷积神经网络以便于利用该卷积神经网络对输入图像进行多层次的特征提取,得到多个特征图。
第二方面,本申请实施例提供了一种图像处理装置,该图像处理装置包括:获取单元,用于获得输入图像和处理需求,所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组进行目标处理;处理单元,用于对所述输入图像进行多层次的特征提取,得到多个特征图;对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;利用变换矩阵W对所述特征图组进行上采样,得到所述目标特征图组,所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同;根据所述处理需求,对所述目标特征图组进行所述目标处理,得到目标图像。
在一个可选的实现方式中,所述处理单元,具体用于分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵; 所述(H×W)个均包括C个元素的一维矩阵中任一矩阵包括的元素为所述特征图组包括的C个(H×W)的二维矩阵中的每个二维矩阵中同一位置的元素,H和W为所述特征图组的两个维度,C为所述特征图组的通道数;所述变换矩阵为由所述训练数据包括的M个标注图像得到的(C×P)的二维矩阵,P=A×B×N,N为所述M个标注图像中的图像语义被分割的类别数;分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组;所述目标特征图组包括的至少一个(A×B×N)的子矩阵为由所述(H×W)个均包括P个元素的一维矩阵中的一个矩阵得到的;其中,H、W、C、N、P、M、A以及B均为大于0的整数。
在一个可选的实现方式中,所述处理单元,具体用于根据所述(H×W)个均包括P个元素的一维矩阵中的任一矩阵,确定(A×B)个均包括N个元素的一维矩阵;将由所述(A×B)个均包括N个元素的一维矩阵得到的一个(A×B×N)的三维矩阵作为所述目标特征图组包括的一个子矩阵。
在一个可选的实现方式中,所述M个标注图像中任一标注图像为一个(H×W×N)的三维矩阵;所述处理单元,用于分别获取所述M个标注图像中的每个标注图像对应的至少一个(A×B×N)的子矩阵以得到多个(A×B×N)的子矩阵;由所述多个(A×B×N)的子矩阵得到多个包括P个元素的向量;其中,由所述多个(A×B×N)的子矩阵中的每一个子矩阵得到一个包括P个元素的向量;将所述多个包括P个元素的向量进行主成分分析以得到一个(P×P)的二维矩阵;将所述(P×P)的二维矩阵包括的一个(C×P)的子矩阵作为所述变换矩阵W。
在一个可选的实现方式中,所述处理单元,具体用于对所述输入图像进行卷积操作得到第一特征图,对第(K-1)特征图进行卷积操作得到第K特征图;所述第K特征图为一个所述参考分辨率的特征图,所述第(K-1)特征图的分辨率不高于所述第K特征图的分辨率,K为大于1的整数,所述多个特征图包括K个特征图;对所述第一特征图进行下采样得到一个所述参考分辨率的特征图,以及对所述第(K-1)特征图进行下采样得到一个所述参考分辨率的特征图。
在一个可选的实现方式中,所述处理单元,具体用于将所述多个具有参考分辨率的特征图在通道维度上进行拼接以得到所述至少一个特征图组;所述特征图组为一个(H×W×C)的三维矩阵且对应所述C个(H×W)的二维矩阵;分别计算所述特征图组中的每个元素位置对应的一维矩阵与所述变换矩阵的乘积,得到所述(H×W)个均包括P个元素的一维矩阵;所述特征图组中的一个元素位置对应的一维矩阵包括的元素为所述C个(H×W)的二维矩阵中的每个二维矩阵中同一元素位置的元素。
在一个可选的实现方式中,所述处理单元,还用于获得所述变换矩阵W;使用卷积神经网络对训练样本做处理,得到所述训练样本的图像分割结果;所述训练样本包含于所述训练数据;根据所述训练样本的图像分割结果和所述训练样本对应的标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述卷积神经网络处理所述训练样本期望得到的结果;利用所述训练样本对应的损失,通过优化算法更新所述卷积神经网络的参数;所述处理单元,具体用于将所述输入图像输入到所述卷积神经网络进行多层次的特征提取,得到所述多个特征图。
第三方面,本申请实施例提供了另一种图像处理装置,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
附图说明
图1为本申请实施例提供的一种系统架构示意图;
图2为本申请实施例提供的一种卷积神经网络的示意图;
图3为本申请实施例提供的一种芯片硬件结构;
图4为本申请实施例提供的一种卷积神经网络的训练方法流程图;
图5为本申请实施例提供的一种由训练样本生成变换矩阵的方法流程图;
图6为本申请实施例提供的一种变换矩阵的计算过程示意图;
图7为本申请实施例提供的一种图像分割方法流程图;
图8为本申请实施例提供的一种利用卷积神经网络处理输入图像以得到K个参考分辨率的特征图的示意图;
图9为本申请实施例提供的一种上采样过程示意图;
图10为本申请实施例提供的一种特征图融合过程以及上采样过程的示意图;
图11为本申请实施例提供的一种特征图融合过程示意图;
图12为本申请实施例提供的另一种图像分割方法流程图;
图13为本申请提供的一种图像处理装置的结构示意图;
图14为本申请提供的一种处理单元的结构示意图;
图15为本申请提供的一种卷积神经网络的训练装置的结构示意图;
图16是本申请实施例提供的一种卷积神经网络的训练装置的硬件结构示意图;
图17是本申请实施例提供的图像处理装置的硬件结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图像语义分割是将待处理图像中需要处理的部分精确地分割出来,进而对分割出的不同部分执行相应的处理。本申请实施例提供的图像分割方法能够应用在拍照、视频拍摄、自动驾驶等场景。下面分别对申请实施例提供的图像分割方法在拍照场景、视频拍摄场景以及自动驾驶场景中的应用进行简单的介绍。
拍照场景:用户利用移动终端(例如手机)拍摄图像,该移动终端将拍摄的图像中的目标对象(例如人像)自动分割出来,以便于添加特效,例如调整景深、换背景、仅保留该目标对象所处区域的颜色、对该目标对象所处区域之外的图像区域进行虚化等。举例来说,用户利用移动终端的相机功能对采集的图像进行实时图像语义分割,让被拍摄对象的 前景清晰,背景虚化,实现单反大光圈的效果。又举例来说,用户利用移动终端拍摄图像后,可以选择需要保留颜色的人像,该移动终端仅保留图像中该人像所处区域的颜色。又举例来说,用户利用移动终端拍摄图像后,该移动终端将拍摄的图像中的目标对象(例如人像)自动分割出来,以便该用户对该图像中除该目标对象所处的区域之外的区域进行调整,例如调整景深、换背景等。
视频拍摄场景1:用户开启移动终端的视频拍摄功能,在拍摄视频的过程中,实时进行图像语义分割,分割出人像区域后,仅保留该人像区域的颜色,实现视频人像留色。
视频拍摄场景2:用户开启移动终端的视频拍摄功能,在被拍摄者有多人的情况下,对所有人像进行分割,用户可以任意点选需要保留清晰的目标人像,该移动终端将该图像中除该目标人像所处区域之外的部分全部虚化,以实现电影模式的效果。
自动驾驶场景:自动驾驶装置(例如汽车)对采集的图像实时进行图像语义分割,在分割出该图像中各个对象后,对分割出的各对象进行物体检测,以便于更准确的识别出行人、障碍物以及车辆等。
在上述场景中,移动终端使用更低分辨率的特征图进行特征图融合,即保证了图像语义分割精度的提升,同时大幅减少了计算量与内存消耗。
下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:
本申请实施例提供的卷积神经网络的训练方法,涉及计算机视觉的处理,具体可以应用于数据训练、机器学习、深度学习等图像处理方法,对训练数据(如本申请中的输入图像)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的卷积神经网络;并且,本申请实施例提供的图像分割方法可以运用上述训练好的卷积神经网络,将输入数据(如本申请中的输入图像)输入到所述训练好的卷积神经网络中,得到输出数据(如本申请中的图像分割结果)。需要说明的是,本申请实施例提供的卷积神经网络的训练方法和图像分割方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。
由于本申请实施例涉及大量卷积神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及卷积神经网络等相关概念进行介绍。
(1)卷积神经网络是一种带有卷积结构的深度卷积神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器对一个输入的图像或者卷积特征平面(feature map)做卷积,输出一个卷积特征平面,卷积特征平面还可以称为特征图。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重对应的权重矩阵就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的,即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(2)损失函数
在训练卷积神经网络的过程中,因为希望卷积神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层卷积神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为卷积神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到卷积神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么卷积神经网络的训练就变成了尽可能缩小这个loss的过程。
(3)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正卷积神经网络中参数的大小,使得卷积神经网络输出的预测值与真正想要的目标值之间的误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的卷积神经网络中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的卷积神经网络的参数,例如权重矩阵,也就卷积层的卷积核。
下面介绍本申请实施例提供的系统架构。
参见图1,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:一个或多个标注图像(即训练样本)以及该一个或多个标注图像对应的真实结果,即利用卷积神经网络处理该一个或多个标注图像期望得到的理想结果;并可将训练数据存入数据库130,训练设备120可基于数据库130中维护的训练数据训练得到目标模型/规则101(101就是前面介绍的经训练阶段训练得到的模型,可以是用于实现图像语音分割操作的卷积神经网络)。一个标注图像对应一个真实结果,也即ground truth。下面将以实施例一更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的图像语音分割方法,即,将输入图像通过相关预处理后得到的图像信息输入该目标模型/规则101,即可得到图像分割结果。本申请实施例中的目标模型/规则101具体可以为训练得到的卷积神经网络,在本申请提供的实施例中,该卷积神经网络是通过训练初始化的卷积神经网络得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如 应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实设备(augmented reality,AR),虚拟现实设备(virtual reality,VR),车载终端等,还可以是服务器等。在图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:输入图像,可以是执行设备110通过数据采集设备160采集的图像,还可以是数据库130中图像,还可以是来自客户设备140的图像。
预处理模块113用于根据I/O接口112接收到的输入数据(如所述输入图像)进行预处理,在本申请实施例中,预处理模块113可以用于实现图像滤波、图像预处理增强、图像预处理平滑、图像预处理复原等中的一项或多项操作,还用于实现其他预处理操作,本申请不做限定。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以实现相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的图像处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据训练得到相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图1仅是本发明实施例提供的一种系统架构的示意图,图1中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。本申请中,基于训练数据训练得到的目标模型/规则101可以是一个用于图像语音分割任务的卷积神经网络。
如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度卷积神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工卷积神经网络,该前馈人工卷积神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其 中池化层为可选的),以及卷积神经网络层230。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为卷积核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,取决于步长stride的取值,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素或者两个像素接着两个像素的进行处理从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。纵深维度也即是通道维度,对应于通道数。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征,对应于高分辨率的特征图;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,对应于低分辨率的特征图,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,可以用于对输入图像进行采样得到较小尺寸的图像,还可以用于对卷积层输入的特征图记性采用得到较小尺寸的 特征图。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
卷积神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(图像分割结果或其他相关信息),卷积神经网络200需要利用卷积神经网络层230来生成一个图像分割结果。因此,在卷积神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像语义分割,图像分类,图像超分辨率重建等等。隐含层可以对卷积层/池化层220输出的特征图执行一系列的处理以得到图像分割结果。后续会详述如何由卷积层/池化层220输出的特征图得到图像分割结果的过程,这里不作详述。
在卷积神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果(即上述图像处理结果)和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
下面介绍本申请实施例提供的一种芯片硬件结构。
图3为本发明实施例提供的一种芯片硬件结构,该芯片包括卷积神经网络处理器30。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图3所示的芯片中得以实现。
卷积神经网络处理器30可以是卷积神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:NPU可以作为协处理器挂载到中央处理器(central processing unit,CPU),也即主CPU(Host CPU)上,由主CPU为其分配任务,例如图像处理任务。NPU的核心部分为运算电路303,通过控制器304控制运算电路303提取存储器(301和302)中的矩阵数据并进行乘加运算。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行 例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路303从权重存储器302中取矩阵B的权重值,并缓存在运算电路303中的每一个PE上。运算电路303从输入存储器301中取矩阵A的输入数据,根据矩阵A的输入数据与矩阵B的权重值进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。输入数据可以为输入图像,权重矩阵即为卷积核。权重数据也可以称为权重矩阵。
统一存储器306用于存放输入数据以及输出数据。权重矩阵直接通过存储单元访问控制器(direct memory access controller,DMAC)305,被搬运到权重存储器302中。输入数据也通过DMAC被搬运到统一存储器306中。输出数据即为图像分割结果。
总线接口单元(bus interface unit,BIU)310,用于DMAC和取指存储器(instruction fetch buffer)309的交互;总线接口单元301还用于取指存储器309从外部存储器获取指令;总线接口单元301还用于存储单元访问控制器305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器306中,或将权重数据搬运到权重存储器302中,或将输入数据搬运到输入存储器301中。
向量计算单元307可以包括多个运算处理单元,在需要的情况下,对运算电路303的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。向量计算单元307主要用于卷积神经网络中非卷积层,或全连接层(fully connected layers,FC)的计算,具体可以处理:池化(pooling),归一化(normalization)等的计算。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。
在一些实现中,向量计算单元307将经处理的向量存储到统一存储器306。在一些实现中,经向量计算单元307处理过的向量能够用作运算电路303的激活输入,例如用于卷积神经网络中后续层中的使用,如图2所示,若当前处理层是隐含层1(231),则经向量计算单元307处理过的向量还可以被用到隐含层2(232)中的计算。
控制器304连接的取指存储器309,用于存储控制器304使用的指令。
统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为On-Chip存储器。外部存储器可以独立于该NPU硬件架构。
其中,图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。
下面将以实施例一更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,即如何基于训练数据训练得到用于实现本申请实施例提供的图像分割方法的卷积神经网络。
实施例一、
图4为本发明实施例一提供的一种卷积神经网络的训练方法400,该方法可包括:
S401、训练设备获得变换矩阵。
训练设备在S403中需要用到变换矩阵,因此该训练设备需要获得变换矩阵。训练设备 可以从数据库130获得该变换矩阵,也可以从其他设备获得该变化矩阵,还可以由训练样本得到变换矩阵。后续会详述如何由训练样本得到变换矩阵的方法。
S402、训练设备初始化卷积神经网络。
初始化卷积神经网络包括初始化卷积神经网络的各层卷积层的卷积核以及其他层(例如池化层、卷积神经网络层和全连接层)的参数。训练设备可以采用任意初始化方法,例如采用高斯分布随机数采样、均匀分布随机数采样等方法来初始化卷积神经网络。
S403、训练设备使用卷积神经网络对训练样本做处理,得到该训练样本的图像分割结果。
该训练样本的图像分割结果用于指示该训练样本中目标对象所处的区域。训练样本可以理解为一个输入图像,训练设备利用卷积神经网络对训练样本所做的处理与执行设备使用卷积神经网络对输入图像所做的处理相同。下面会详述执行设备使用卷积神经网络对输入图像做处理的过程,这里不再详述训练设备使用卷积神经网络对训练样本做处理的过程。可选的,训练设备在使用卷积神经网络对训练样本做处理之前,还可以对该训练样本做预处理。例如对训练样本进行图像滤波、图像预处理增强、图像预处理平滑、图像预处理复原。图像处理装置还可以对该训练样本进行其他图像预处理操作,本申请不做限定。图像滤波主要包括调整图像尺寸,并对缩放后的图像中的噪声进行去噪平滑处理。图像预处理增强是对图像中的信息有选择地加强和抑制,以改善图像的视觉效果,或将图像转变为更适合于机器处理的形式,以便于数据抽取或识别。图像预处理平滑是消除图像中的随机噪声。图像预处理复原是校正各种原因所造成的图像退化,使重建或估计得到的图像尽可能逼近于理想无退化的图像。
S404、训练设备根据该训练样本的图像分割结果和所述训练样本对应的标准结果,确定该训练样本对应的损失值。
该训练样本对应的标准结果(也称真实结果)为利用卷积神经网络处理该训练样本期望得到的结果。训练设备可以利用卷积神经网络所做的图像语义分割任务对应的损失函数来计算该训练样本对应的损失值。如前文的基础概念介绍所述,损失函数定义“如何比较预测值和目标值之间的差异”,即损失函数是用于衡量预测值和目标值的差异的重要方程。本申请实施例中,训练样本的图像分割结果对应于预测值,训练样本的标准结果对应于目标值。损失函数的输出值(loss)越高表示图像分割结果与标准结果的差异越大,那么卷积神经网络的训练就变成了尽可能缩小这个loss的过程。
S405、训练设备判断卷积神经网络是否收敛。
若是,执行S407;否则,执行S406。训练设备判断卷积神经网络是否收敛可以是判断更新卷积神经网络的参数的次数是否到达迭代阈值,即S406执行的次数;也可以是判断卷积神经网络的损失值是否低于损失阈值。卷积神经网络的损失值是训练设备利用该卷积神经网络的损失函数计算得到的该卷积神经网络输出的图像分割结果和标准结果之间的误差。训练设备的训练任务不同,卷积神经网络的损失函数也不同。迭代阈值可以是训练设备预先设置的迭代次数,例如10000次、20000次等。损失阈值可以是训练设备预先设置的,若卷积神经网络输出的图像处理结果与标准结果之间的差值小于该损失阈值,则结束训练。
S406、训练设备利用该训练样本对应的损失值,通过优化算法更新该卷积神经网络的 参数。
训练设备可以利用得到的损失值通过反向传播算法更新卷积神经网络的参数。例如使用随机梯度下降算法利用训练样本对应的损失值更新卷积神经网络的参数。
S407、结束训练。
所述方法400具体可以由如图1所示的训练设备120执行,所述方法400中的输入图像(即训练样本)可以是如图1所示的数据库130中维护的训练数据。可选的,在执行S403之前,可以对训练样本做图像预处理,S403处理的训练样本是图像预处理后的训练样本。对训练样本的图像预处理操作可以在训练设备120中执行,也可以输入训练设备120之前由其他功能模块预先执行,即先对从所述数据库130中接收或者获取到的训练样本进行图像预处理,如S401,得到图像预处理后的训练样本,作为所述训练设备120的输入,并由所述训练设备120执行S401至S407。
可选的,所述方法400可以由CPU处理,也可以由CPU和图形处理器(graphics processing unit,GPU)共同处理,也可以不用GPU,而使用其他适合用于卷积神经网络计算的处理器,本申请不做限制。
本申请实施例中,可以训练得到一个使用较低分辨率的特征图得到准确的图像分割结果的卷积神经网络,执行设备使用训练得到的卷积神经网络做图形语音分割能够大幅减少其计算量与内存消耗。
实施例一中,训练设备执行S403的过程中需要使用变换矩阵。下面介绍一下训练设备如何由训练样本得到变换矩阵的方法。图5为本申请实施例提供的一种由训练样本生成变换矩阵的方法流程图,该方法可包括:
S501、训练设备将训练数据中的每个训练样本(即标注图像)分成多个对应(A×B×N)的三维矩阵的子样本。
每个子样本可以理解为训练样本中的一个小块,即训练样本的一部分。训练数据中的每个训练样本均为一个(H×W×N)的三维矩阵。训练设备可以对每个训练样本进行分块操作,即将每个训练样本对应的(H×W×N)的三维矩阵分成多个(A×B×N)的子矩阵(子样本)。图6为本申请实施例提供的一种变换矩阵的计算过程示意图。图6所示,可以将一个(H×W×N)的三维矩阵,即一个训练样本,分成多个子样本,每个子样本对应一个(A×B×N)的三维矩阵。A、B以及N均为大于0的整数。N为每个训练样本中的图像语义被分割的类别数。
S502、训练设备将每个子样本排列为一个包括(A×B×N)个元素的向量。
如图6所示,训练设备将每个子样本(图6中的小块)重排列得到一个包括(4×N)个元素的向量。训练设备可以由一个子样本得到一个包括(A×B×N)个元素的向量。
S503、训练设备将获得的所有包括(A×B×N)个元素的向量进行主成分分析,得到(A×B×N)×(A×B×N)的中间矩阵。
该中间矩阵为一个二维矩阵。主成分分析(principal component analysis,PCA)是一种统计方法,通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量,转换后的这组变量叫主成分。训练设备实现S503的步骤可以如下:(1)训练设备将获得的所有包括(A×B×N)个元素的向量,合并成Q×(A×B×N)的二维矩阵X’。(2)对X’ 进行归一化(均值为0,标准差为1),得到归一化的二维矩阵X。(3)对二维矩阵X进行奇异值分解得到(P×P)的中间矩阵。Q为所有包括(A×B×N)个元素的向量的个数。该中间矩阵为对X进行奇异值分解得到的本征矩阵U。P=A×B×N。对X进行奇异值分解的公式如下:
(U,S,VT)=FSVD(X);
其中,U和V的列分别叫做X的左奇异向量(left-singular vectors)和右奇异向量(right-singular vectors),S的对角线上的值叫做X的奇异值(singular values)。U的列由XX T的单位化过的特征向量构成;V的列由X T X的单位化过的特征向量构成;S的对角元素来源于XTX或X X T的特征值的平方根,并且是按从大到小的顺序排列的。奇异值分解(singular value decomposition,SVD)是一种矩阵分解的方法。SVD是一种常用的方法,这里不再详述。
S504、训练设备从该中间矩阵中取出前C维主成分,得到最终的变换矩阵。
变换矩阵为一个(C×(A×B×N))的二维矩阵。举例来说,A和B均为2,变换矩阵为一个(C×4N)的二维矩阵。变换矩阵可以为该中间矩阵的前C行对应的子矩阵。
本申请实施例中,由训练样本生成变换矩阵,以便于使用较低分辨率的特征图得到准确的图像分割结果。
前述实施例介绍了如何训练得到用于实现图像语义分割任务的卷积神经网络的训练方法,下面介绍如何利用训练得到的卷积神经网络执行图像语义分割任务的方法。图7为本申请实施例提供的一种图像分割方法,该方法可包括:
S701、图像处理装置获得输入图像和处理需求。
输入图像在通道上的二维矩阵为(H×A)×(W×B)的矩阵。图像处理装置即是前面提到的执行设备。H、W、A以及B均为大于0的整数。图像处理装置获得输入图像可以是图像处理装置利用摄像头获取该输入图像,也可以是从客户设备、数据库获得该输入图像,还可以通过其他方式获得该输入图像,本申请不作限定。所述处理需求可以是用户输入的,也可以是图像处理装置预先配置的。所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组(即图像分割结果)进行目标处理以得到目标图像。所述处理需求可以指示对输入图像中除该目标对象所处的区域之外的区域进行调整,例如调整景深、换背景等;可以指示确定输入图像中的人像所处的区域,并仅保留该人像所处的区域的颜色;还可以指示对该输入图像进行其他处理,本申请不作限定。本申请中,目标特征图组与图像分割结果是相同的概念。图像处理装置可以根据目标特征图组确定输入图像中不同对象所需的区域,例如人像所需的区域。
S702、图像处理装置对输入图像进行多层次的特征提取,得到多个特征图。
S703、图像处理装置对上述多个特征图进行下采样,得到多个具有参考分辨率的特征图。
所述参考分辨率低于所述输入图像的分辨率。
S704、图像处理装置对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组。
S705、图像处理装置利用变换矩阵W对所述特征图组进行上采样,得到目标特征图组。
所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同。上述目标特征图组即为对该输入图像进行图像分割得到的图像分割结果。该目标特征图组用于指示该输入图像中的目标对象所处区域,该目标对象可以是该输入图像中的人像,也可以是预先设置的检测对象(例如猫、狗等),还可以是用户选择的该输入图像中的对象。
S706、图像处理装置根据所述处理需求,对所述目标特征图组进行目标处理,得到目标图像。
上述对所述目标特征图组进行目标处理可以是根据目标特征图组确定上述输入图像中不同对象所处的区域,进而对某个区域做目标处理。举例来说,图像处理装置在根据目标特征图组确定输入图像中被拍摄对象所处的区域后,让被拍摄对象的前景清晰,背景虚化,实现单反大光圈的效果。又举例来说,用户利用图像处理装置拍摄图像后,可以选择需要保留颜色的人像(即处理需求),该图像处理装置对该图像进行图像语义分割,并根据得到的图像分割结果确定该人像所需的区域,进而仅保留该图像中该人像所处区域的颜色。又举例来说,用户利用图像处理装置拍摄图像后,该图像处理装置对拍摄的图像进行图像语义分割,并根据图像分割结果确定该图像中的目标对象(例如人像)所处的区域,以便该用户对该图像中除该目标对象所处的区域之外的区域进行调整,例如调整景深、换背景等。
下面详细描述一下S702-S703的实现方式。图像处理装置执行S702-S703的步骤可以如下:图像处理装置利用卷积神经网络处理输入图像以得到K个参考分辨率的特征图。
该K个参考分辨率的特征图对应C个(H×W)的二维矩阵,C和K均为大于0的整数。图8为本申请实施例提供的一种利用卷积神经网络处理输入图像以得到K个参考分辨率的特征图的示意图。参阅图8,图像处理装置利用卷积神经网络处理该输入图像以得到K个参考分辨率的特征图的实现方式可以如下:对该输入图像进行卷积操作得到第一特征图,对该第一特征图进行卷积操作得到第二特征图,依次类推,直到对第(K-1)特征图进行卷积操作得到第K特征图;对该第一特征图进行下采样处理得到一个参考分辨率的特征图,对该第二特征图进行下采样处理得到一个参考分辨率的特征图,依次类推直到对第(K-1)特征图进行下采样处理得到一个参考分辨率的特征图。其中,该第K特征图为一个参考分辨率的特征图;该第(K-1)特征图的分辨率不高于该第K特征图的分辨率,K为大于1的整数。图8的虚线框中的特征图为处理输入图像得到的K个参考分辨率的特征图。在该实现方式中,该第一特征图至该第K特征图的分辨率依次降低。该卷积神经网络可以包括多个卷积层(对应卷积模块)和下采样层(对应下采样模块),上一层卷积层输出的特征图为下一层卷积层的输入。也就是说,图像处理装置可以利用卷积层对输入图像进行卷积操作得到一个特征图,对得到的特征图继续使用卷积层进行卷积操作,得到新的特征图,持续操作,直到达到指定的卷积操作次数,得到K个分辨率不同的特征图。可选的,图像处理装置利用使用不同的卷积核对同一个特征图进行卷积操作得到不同的特征图。也就是说,这K个特征图中的不同特征图可以是由同一个特征图得到的。本申请不限定图像处理装置由输入图像得到K个特征图的方式。
下面举例说明图像处理装置实现卷积操作的过程。举例来说,图像处理装置计算第(l-1) 层卷积层的输入图像(即该层卷积层输入的特征图)与卷积核进行卷积,添加偏置
Figure PCTCN2020077366-appb-000001
后,通过激活函数f,得到特征图
Figure PCTCN2020077366-appb-000002
Figure PCTCN2020077366-appb-000003
公式(1)中的M j代表了与第j个神经元连接的一系列输入图像,(*)代表卷积运算,∑(·)代表了求和运算。激活函数f可以选择sigmoid函数、tanh函数、ReLU函数或其它类型的激活函数,本申请不作限定。
图像处理装置可以采用例如:双线性插值、最邻近插值、中值插值、均值插值等下采样的方式对卷积层(即卷积模块)输出的各个特征图进行下采样,以降低各特征图的分辨率,使得每个特征图的分辨率与最后一个卷积层输出的特征图的分辨率保持一致。下面以双线性下采样为例,说明下采样的过程。假设原始图像(即输入图像)的大小为size=m×n,其中,m与n分别是该原始图像的行数与列数。若原始图像的缩放因子(也成比例因子)是t(0<t<1),即将原始图像缩小1/t倍,则目标图像的大小size=(m×t)×(n×t)。对于该目标图像的某个像素点P(x,y)通过P/t可得到对应的原始图像的像素点P’的坐标(x1,y1)。其中,x1=x/t,y1=y/t,如果x1和y1都不是整数,可以找出与(x1,y1)相邻的四个点的灰度f1、f2、f3、f4,使用双线性插值算法就可以得到这个像素点P’(x1,y1)的灰度。
一个完整的双线性插值算法可描述如下:
(1)通过原始图像和比例因子得到新图像的大小,并创建新图像(即目标图像)。
(2)由新图像的某个像素(x,y)映射到原始图像(x’,y’)处。
(3)对x’和y’取整得到(xx,yy)并得到(xx,yy)、(xx+1,yy)、(xx,yy+1)和(xx+1,yy+1)这四个像素点的值。
(4)利用得到的四个像素点的值进行双线性插值得到像素点(x,y)的值并写回新图像。
(5)重复步骤(2)直到确定新图像的所有像素的值。
下面详细描述一下S704-S705的实现方式。图像处理装置实现S704-S705的步骤可以如下:图像处理装置利用卷积神经网络分别计算(H×W)个均包括C个元素的一维矩阵与变换矩阵的乘积得到(H×W)个均包括P个元素的一维矩阵;图像处理装置利用卷积神经网络分别对该(H×W)个均包括P个元素的一维矩阵进行特征排列以得到目标特征图组。
该(H×W)个均包括C个元素的一维矩阵中任一矩阵包括的元素为该K个参考分辨率的特征图对应的C个(H×W)的二维矩阵中的每个二维矩阵中同一位置的元素。该C个(H×W)的二维矩阵对应一个(H×W×C)的三维矩阵。图9为本申请实施例提供的一种上采样过程示意图。图9中的(H×W×C)的三维矩阵即为该C个(H×W)的二维矩阵对应的三维矩阵。如图9所示,该(H×W×C)的三维矩阵中每个元素位置在通道维度上对应的一个包括C个元素的一维矩阵,例如该三维矩阵中的黑色柱型区域对应一个包括C个元素的一维矩阵。可以理解,该C个(H×W)的二维矩阵对应(H×W)个均包括C个元素的一维矩阵,每个C个元素的一维矩阵与变换矩阵相乘可以得到一个包括P个元素的一维矩阵。该变换矩阵为由M个标注图像得到的(C×P)的二维矩阵,P=A×B×N, N为该M个标注图像中的图像语义被分割的类别数,前述实施例已介绍得到该变换矩阵的方式,这里不再详述。其中,H、W、C、N、P、M、K、A以及B均为大于0的整数。
图像处理装置利用卷积神经网络分别对该(H×W)个均包括P个元素的一维矩阵进行特征排列以得到图像分割结果的方式如下:图像处理装置利用卷积神经网络根据每个包括P个元素的一维矩阵,确定(A×B)个均包括N个元素的一维矩阵;利用由一个包括P个元素的一维矩阵得到的(A×B)个均包括N个元素的一维矩阵,得到一个(A×B×N)的三维矩阵;将每个(A×B×N)的三维矩阵作为图像分割结果包括的一个子矩阵。可以理解,每个包括P个元素的一维矩阵经过特征排列可以得到一个(A×B×N)的三维矩阵,并作为图像分割结果包括的一个子矩阵。图9中的((H×A)×(W×B)×N)的三维矩阵即为图像分割结果。如图9所示,图像处理装置可以利用每个包括P个元素的一维矩阵得到一个(A×B×N)的三维矩阵,并作为图像分割结果的一部分。在实际应用中,图像处理装置可以依次处理每个包括P个元素的一维矩阵得到一个(A×B×N)的三维矩阵,并作为图像分割结果包括的一个子矩阵,最终得到该图像分割结果。
本申请实施例中,图像处理装置利用卷积神经网络对输入图像进行卷积操作以及下采样得到多个较低分辨率的特征图,并使用较低分辨率的特征图进行特征排列得到图像分割结果,能够有效减少内存占用以及计算量,并保持图像语义分割的精度较高。
在一个可选的实现方式中,图像处理装置实现S704的步骤可以如下:图像处理装置将该K个参考分辨率的特征图在通道维度上进行拼接以得到融合特征图,即图9中的(H×W×C)的三维矩阵。
该图像处理装置分别计算该融合特征图中的每个元素位置对应的一维矩阵与所述变换矩阵的乘积,得到该(H×W)个均包括P个元素的一维矩阵。该融合特征图中的一个元素位置对应的一维矩阵包括的元素为该C个(H×W)的二维矩阵中的每个二维矩阵中同一元素位置的元素。该融合特征图为一个(H×W×C)的三维矩阵且对应该C个(H×W)的二维矩阵。本申请中,融合特征图即为步骤S704中对多个具有参考分辨率的特征图进行融合得到的至少一个特征图组。图10为本申请实施例提供的一种特征图融合过程以及上采样过程的示意图。如图10所示,虚线构成的矩形框中的特征图即为图像处理装置上述K个参考分辨率的特征图;(H×W×C)的三维矩阵即为融合该K个参考分辨率的特征图得到的融合特征图;((H×A)×(W×B)×N)的三维矩阵为上采样该融合特征图得到的图像分割结果。其中,图10中的上采样即为图9中的上采样。卷积神经网络的一个隐含层(对应特征图融合模块)用于融合该K个参考分辨率的特征图得到的该融合特征图。卷积神经网络的一个隐含层(对应上采样模块)用于上采样该融合特征图以得到图像分割结果。
图像处理装置可以按照通道维度对该K个参考分辨率进行拼接。对于任意特征图,有如下维度的描述:n*Channel*H*W。其中,H、W分别代表特征图长和宽;n代表输入整个卷积神经网络的图像数量,Channel代表通道数。图像处理装置可以将两个及两个以上的特征图按照在Channe l维度(即通道维度)或n维度上进行拼接。特征图融合模块的作用就是将两个及两个以上的特征图按照在Channe l维度(即通道维度)或n维度上进行拼接。举个例子,如果说是在Channel维度上进行拼接特征图1和特征图2的话,首先除了channel维度可以不一样,其余维度必须一致(也就是n、H、W一致)。如图11所示(为了便于画 图说明,n设置为1),图像处理装置所做的操作仅仅是在特征图1的Channel 1加上特征图2的Channel 2,得到的融合特征图的维度为:n*(Channel 1+Channel 2)*H*W。
下面结合图8至图10来介绍本申请提供的图像分割方法。图12为本申请实施例提供的另一种图像分割方法,该方法可包括:
S1201、图像处理装置获取输入图像。
S1202、图像处理装置利用卷积神经网络的各卷积层对该输入图像进行卷积操作,得到K个特征图。
该多个特征图对应于前面提到的第一特征图至第K特征图。图像处理装置可以利用第一层卷积层对输入图像进行卷积操作得到一个特征图,对得到的特征图继续使用第二层卷积层进行卷积操作,得到新的特征图,持续操作,直到达到指定的卷积操作次数,得到K个分辨率不同的特征图。也就是说,前一层卷积层输出的特征图为后一层卷积层的输入,卷积神经网络的各层卷积层输出的特征图组成该K个特征图。各卷积层执行的卷积操作可参阅公式(1)。
S1203、图像处理装置对该K特征图中的(K-1)个特征图进行下采样得到(K-1)个参考分辨率的特征图。
该(K-1)个特征图为该K特征图中除卷积神经网络的最后一层卷积层输出的特征图之外的特征图。下采样过程可参阅图8。
S1204、图像处理装置融合该(K-1)个参考分辨率的特征图以及卷积神经网络的最后一层卷积层输出的特征图得到融合特征图。
S1204对应图10和图11中的融合操作。
S1205、图像处理装置对该融合特征图进行上采样得到图像分割结果。
S1205对应图10中的上采样操作。卷积神经网络的隐含层可以实现S1203中的下采样操作、S1204中的融合操作以及S1205中的上采样操作。
在实际应用中,图像处理装置在得到输入图像的图像分割结果后,可以根据该图像分割结果做进一步的处理。图像处理装置可以是移动终端,例如手机。举例来说,用户利用移动终端的相机功能对采集的图像进行实时图像语义分割以得到图像分割结果,该移动终端在根据该图像分割结果确定该图像中的被拍摄对象所处的区域后,让被拍摄对象的前景清晰,背景虚化,实现单反大光圈的效果。又举例来说,用户利用移动终端拍摄图像后,可以选择需要保留颜色的人像,该移动终端对该图像进行图像语义分割,并根据得到的图像分割结果确定该人像所需的区域,进而仅保留该图像中该人像所处区域的颜色。又举例来说,用户利用移动终端拍摄图像后,该移动终端对拍摄的图像进行图像语义分割,并根据图像分割结果确定该图像中的目标对象(例如人像)所处的区域,以便该用户对该图像中除该目标对象所处的区域之外的区域进行调整,例如调整景深、换背景等。又举例来说,用户开启移动终端的视频拍摄功能,在拍摄视频的过程中,该移动终端实时进行图像语义分割,根据图像分割结果确定出人像区域后,仅保留该人像区域的颜色,实现视频人像留色。又举例来说,用户开启移动终端的视频拍摄功能,该移动终端实时进行图像语义分割,在被拍摄者有多人的情况下,该移动终端根据图像分割结果对所有人像进行分割,用户可以任意点选需要保留清晰的目标人像,该移动终端将该图像中除该目标人像所处区域之外 的部分全部虚化,以实现电影模式的效果。又举例来说,自动驾驶装置(例如汽车)对采集的图像实时进行图像语义分割,在根据图像分割结果分割出该图像中的各个对象后,对分割出的各对象进行物体检测,以便于更准确的识别出行人、障碍物以及车辆等。可以理解,图像处理装置可以利用图像分割结果准确地确定输入图像中的各对象所处的区域,以便于对图像中的不同对象或不同区域执行不同的处理。
前述实施例介绍了图像分割方法,下面介绍图像处理装置的结构,并结合其结构进一步介绍该图像处理装置实现图像语义分割任务所执行的操作。图像处理装置即为执行设备。图13为本申请提供的一种图像处理装置的结构示意图。如图13所示,该图像处理装置1300可包括:
获取单元1301,用于获得输入图像和处理需求;该输入图像在通道上的二维矩阵为(H×A)×(W×B)的矩阵;所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组(即图像分割结果)进行目标处理以得到目标图像;
处理单元1302,用于对所述输入图像进行多层次的特征提取,得到多个特征图;对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;利用变换矩阵W对所述特征图组进行上采样,得到目标特征图组,所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同;根据所述处理需求,对所述目标特征图组进行目标处理,得到目标图像。
获取单元1301的功能可以由图像处理装置中的摄像头或者I/O接口实现。处理单元1302的功能可以由图像处理装置中的CPU实现,也可以由CPU配合其他处理器(例如NPU、TPU、GPU等)实现。
在一个可选的实现方式中,如图14所示,处理单元1302可包括:
卷积模块1401,用于对输入图像和/或特征图进行卷积操作以得到特征图,并向下一层卷积层输出得到的特征图;
下采样模块1402,用于对各卷积模块输出的特征图进行下采样以得到参考分辨率的特征图;
特征图融合模块1403,用于融合各参考分辨率的特征图以得到融合特征图;
上采样模块1404,用于对该融合特征图进行特征排列以得到图像分割结果。
卷积模块1401用于实现卷积神经网络中各卷积层的卷积操作,参阅图8中的卷积操作。可选的,该图像处理装置包括一个卷积模块,该卷积模块实现各卷积层的卷积操作。可选的,该图像处理装置包括K个卷积模块,每个卷积模块用于实现一个卷积层的卷积操作。下采样模块1402用于实现图8中的下采样,即对除最后一层卷积层输出的特征图之外的各特征图进行下采样以得到参考分辨率的特征图。特征图融合模块1403用于实现图10和图11中的特征图融合操作。上采样模块1404用于实现图10中的上采样操作。卷积模块1401、下采样模块1402、特征图融合模块1403、上采样模块1404可以均用软件实现,也可以均用硬件实现,还可以一个部分用软件实现另一部分用硬件实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。上述计算机程序产品包括一个或 多个计算机指令。在图像处理装置上加载或执行上述计算机程序指令时,全部或部分地产生按照本发明实施例上述的流程或功能。上述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。上述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state Drive,SSD)。可选的,图像处理装置运行存储于图像处理装置的存储器的软件代码实现卷积模块1401、下采样模块1402、特征图融合模块1403、上采样模块1404的功能,即实现处理单元1302的功能。可选的,图像处理装置运行固化于该图像处理装置的处理器的硬件代码实现前述图像分割方法。
目前,对图像先进行编码,再进行解码的架构是计算机视觉任务中常用的一种图像处理的方法,很多计算机视觉的技术都使用了这个框架。本申请实施例中,图像处理装置也是使用对图像先进行编码,再进行解码的架构,即采用编码器-解码器架构的卷积神经网络,来处理图像语义分割任务。卷积神经网络可以分为编码器和解码器两部分,其中,编码器包括图14中的卷积模块1401和下采样模块1402,解码器包括图14中的特征图融合模块1403和上采样模块1404。本申请提供的方案相比于现有技术方案至少具体以上两个优势:
1、对高层特征图进行信息融合,保留了原始的结构信息,分割精度提升。
现有技术方案中,为了获得高分辨率的预测,解码器模块只能选择具有高分辨率的低层特征图进行特征图聚合。也就是说,高层低分辨率的特征图,进行上采样之后,与低层高分辨的特征图进行融合。本方案中,将低层高分辨特征图进行下采样后,与高层低分辨率直接融合,如图9和图10所示。同时在后续上采样过程中,采用了数据相关的上采样模块,保留了输入图片的原始结构信息,分割精度得到提升。
2、计算量降低、内存消耗减少。
现有技术方案中,解码器模块选择了高分辨率的低层特征图进行特征图融合。由于卷积神经网络的计算量取决于特征图的分辨率大小,使用低层特征图进行特征图融合,会显著提高卷积神经网络的计算量,因此现有技术方案计算量较大,无法在手机端实时运行。在本方案中,选用更低分辨率的特征图进行特征图融合,即保证了分割精度的提升,同时大幅减少了计算量与内存消耗。
图15为本申请提供的一种卷积神经网络的训练装置的结构示意图。如图15所示,该图像处理装置1500可包括:
获取单元1501,用于获得上述变换矩阵;
处理单元1502,用于使用卷积神经网络对训练样本做处理,得到该训练样本的图像分割结果;根据该训练样本的图像分割结果和该训练样本对应的标准结果,确定该训练样本对应的损失;利用该训练样本对应的损失,通过优化算法更新该卷积神经网络的参数。
该训练样本包括上述N个标注图像中的至少一个;该标准结果为利用该神经网络处理所述训练样本期望得到的结果。
本申请实施例中,训练装置使用训练样本训练卷积神经网络,可以快速地训练得到一个可用于处理图像语义分割任务的卷积神经网络。
图16是本申请实施例提供的一种卷积神经网络的训练装置的硬件结构示意图。图16所示的卷积神经网络的训练装置1600(该装置1600具体可以是一种计算机设备)包括存 储器1601、处理器1602、通信接口1603以及总线1604。其中,存储器1601、处理器1602、通信接口1603通过总线1604实现彼此之间的通信连接。
存储器1601可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1601可以存储程序,当存储器1601中存储的程序被处理器1602执行时,处理器1602和通信接口1603用于执行本申请实施例的卷积神经网络的训练方法的各个步骤。
处理器1602可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的卷积神经网络的训练装置中的单元所需执行的功能,或者执行本申请方法实施例的卷积神经网络的训练方法。
处理器1602还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的卷积神经网络的训练方法的各个步骤可以通过处理器1602中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1602还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1601,处理器1602读取存储器1601中的信息,结合其硬件完成本申请实施例的卷积神经网络的训练装置中包括的单元所需执行的功能,或者执行本申请方法实施例的卷积神经网络的训练方法。
通信接口1603使用例如但不限于收发器一类的收发装置,来实现装置1600与其他设备或通信网络之间的通信。例如,可以通过通信接口1603获取训练数据(如本申请实施例一所述的训练样本)。
总线1604可包括在装置1600各个部件(例如,存储器1601、处理器1602、通信接口1603)之间传送信息的通路。
应理解,卷积神经网络的训练装置1500中的获取单元1501相当于卷积神经网络的训练装置1600中的通信接口1603,处理单元1502可以相当于处理器1602。
图17是本申请实施例提供的图像处理装置的硬件结构示意图。图17所示的图像处理装置1700(该装置1700具体可以是一种计算机设备)包括存储器1701、处理器1702、通信接口1703以及总线1704。其中,存储器1701、处理器1702、通信接口1703通过总线1704实现彼此之间的通信连接。
存储器1701可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器1701可以存储程序,当存储器1701中存储的程序被处理器1702执行时,处理器1702和通信接口1703用于执行本申请实施例的图像分割方法的各个步骤。
处理器1702可以采用通用的中央处理器,微处理器,应用专用集成电路,图形处理器 (graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图像处理装置1300中的单元所需执行的功能,或者执行本申请方法实施例的图像分割方法。
处理器1702还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的图像分割方法的各个步骤可以通过处理器1702中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1702还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1701,处理器1702读取存储器1701中的信息,结合其硬件完成本申请实施例的图像处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的图像分割方法。
通信接口1703使用例如但不限于收发器一类的收发装置,来实现装置1700与其他设备或通信网络之间的通信。例如,可以通过通信接口1703获取训练数据(如本申请实施例二所述的输入图像)。
总线1704可包括在装置1700各个部件(例如,存储器1701、处理器1702、通信接口1703)之间传送信息的通路。
应理解,图像处理装置1300中的获取单元1301,相当于图像处理装置1700中的通信接口1703;图像处理装置1300中的处理单元1301可以相当于处理器1702。
应注意,尽管图16和图17所示的装置1600和1700仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置1600和1700还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置1600和1700还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置1600和1700也可仅仅包括实现本申请实施例所必须的器件,而不必包括图16或图17中所示的全部器件。
可以理解,所述装置1600相当于1中的所述训练设备120,所述装置1700相当于图1中的所述执行设备110。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的 划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (15)

  1. 一种图像分割方法,其特征在于,包括:
    获得输入图像和处理需求;所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组进行目标处理;
    对所述输入图像进行多层次的特征提取,得到多个特征图;
    对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;
    对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;
    利用变换矩阵W对所述特征图组进行上采样,得到所述目标特征图组,所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同;
    根据所述处理需求,对所述目标特征图组进行所述目标处理,得到目标图像。
  2. 根据权利要求1所述的方法,其特征在于,所述利用变换矩阵W对所述特征图组进行上采样,得到目标特征图组包括:
    分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵;所述(H×W)个均包括C个元素的一维矩阵中任一矩阵包括的元素为所述特征图组包括的C个(H×W)的二维矩阵中的每个二维矩阵中同一位置的元素,H和W为所述特征图组的两个维度,C为所述特征图组的通道数;所述变换矩阵为由所述训练数据包括的M个标注图像得到的(C×P)的二维矩阵,P=A×B×N,N为所述M个标注图像中的图像语义被分割的类别数;
    分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组;所述目标特征图组包括的至少一个(A×B×N)的子矩阵为由所述(H×W)个均包括P个元素的一维矩阵中的一个矩阵得到的;其中,H、W、C、N、P、M、A以及B均为大于0的整数。
  3. 根据权利要求2所述的方法,其特征在于,所述分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组包括:
    根据所述(H×W)个均包括P个元素的一维矩阵中的任一矩阵,确定(A×B)个均包括N个元素的一维矩阵;
    将由所述(A×B)个均包括N个元素的一维矩阵得到的一个(A×B×N)的三维矩阵作为所述目标特征图组包括的一个子矩阵。
  4. 根据权利要求2所述的方法,其特征在于,所述M个标注图像中任一标注图像为一个(H×W×N)的三维矩阵,所述变换矩阵W为采用如下操作得到的:
    分别获取所述M个标注图像中的每个标注图像对应的至少一个(A×B×N)的子矩阵以得到多个(A×B×N)的子矩阵;
    由所述多个(A×B×N)的子矩阵得到多个包括P个元素的向量;其中,由所述多个 (A×B×N)的子矩阵中的每一个子矩阵得到一个包括P个元素的向量;
    将所述多个包括P个元素的向量进行主成分分析以得到一个(P×P)的二维矩阵;
    将所述(P×P)的二维矩阵包括的一个(C×P)的子矩阵作为所述变换矩阵W。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述对所述输入图像进行多层次的特征提取,得到多个特征图包括:
    对所述输入图像进行卷积操作得到第一特征图,对第(K-1)特征图进行卷积操作得到第K特征图;所述第K特征图为一个所述参考分辨率的特征图,所述第(K-1)特征图的分辨率不高于所述第K特征图的分辨率,K为大于1的整数,所述多个特征图包括K个特征图;
    所述对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图包括:
    对所述第一特征图进行下采样得到一个所述参考分辨率的特征图,以及对所述第(K-1)特征图进行下采样得到一个所述参考分辨率的特征图。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组包括:
    将所述多个具有参考分辨率的特征图在通道维度上进行拼接以得到所述至少一个特征图组;所述特征图组为一个(H×W×C)的三维矩阵且对应所述C个(H×W)的二维矩阵;
    所述分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵包括:
    分别计算所述特征图组中的每个元素位置对应的一维矩阵与所述变换矩阵的乘积,得到所述(H×W)个均包括P个元素的一维矩阵;所述特征图组中的一个元素位置对应的一维矩阵包括的元素为所述C个(H×W)的二维矩阵中的每个二维矩阵中同一元素位置的元素。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    获得所述变换矩阵W;
    使用卷积神经网络对训练样本做处理,得到所述训练样本的图像分割结果;所述训练样本包含于所述训练数据;
    根据所述训练样本的图像分割结果和所述训练样本对应的标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述卷积神经网络处理所述训练样本期望得到的结果;
    利用所述训练样本对应的损失,通过优化算法更新所述卷积神经网络的参数;
    所述对所述输入图像进行多层次的特征提取,得到多个特征图包括:
    将所述输入图像输入到所述卷积神经网络进行多层次的特征提取,得到所述多个特征图。
  8. 一种图像处理装置,其特征在于,包括:
    获取单元,用于获得输入图像和处理需求;所述处理需求用于指示对所述输入图像进行图像分割得到的目标特征图组进行目标处理;
    处理单元,用于对所述输入图像进行多层次的特征提取,得到多个特征图;对所述多个特征图进行下采样,得到多个具有参考分辨率的特征图;所述参考分辨率低于所述输入图像的分辨率;对所述多个具有参考分辨率的特征图进行融合,得到至少一个特征图组;利用变换矩阵W对所述特征图组进行上采样,得到所述目标特征图组,所述目标特征图组和所述输入图像具有相同分辨率;其中,所述变换矩阵W是通过对图像分割任务的训练数据建模得到;所述变换矩阵W的其中一个维度与所述特征组的通道数相同;根据所述处理需求,对所述目标特征图组进行所述目标处理,得到目标图像。
  9. 根据权利要求8所述的装置,其特征在于,所述处理单元,具体用于分别计算(H×W)个均包括C个元素的一维矩阵与所述变换矩阵W的乘积得到(H×W)个均包括P个元素的一维矩阵;所述(H×W)个均包括C个元素的一维矩阵中任一矩阵包括的元素为所述特征图组包括的C个(H×W)的二维矩阵中的每个二维矩阵中同一位置的元素,H和W为所述特征图组的两个维度,C为所述特征图组的通道数;所述变换矩阵为由所述训练数据包括的M个标注图像得到的(C×P)的二维矩阵,P=A×B×N,N为所述M个标注图像中的图像语义被分割的类别数;分别对所述(H×W)个均包括P个元素的一维矩阵进行特征排列以得到所述目标特征图组;所述目标特征图组包括的至少一个(A×B×N)的子矩阵为由所述(H×W)个均包括P个元素的一维矩阵中的一个矩阵得到的;其中,H、W、C、N、P、M、A以及B均为大于0的整数。
  10. 根据权利要求8所述的装置,其特征在于,所述处理单元,具体用于根据所述(H×W)个均包括P个元素的一维矩阵中的任一矩阵,确定(A×B)个均包括N个元素的一维矩阵;将由所述(A×B)个均包括N个元素的一维矩阵得到的一个(A×B×N)的三维矩阵作为所述目标特征图组包括的一个子矩阵。
  11. 根据权利要求9所述的装置,其特征在于,所述M个标注图像中任一标注图像为一个(H×W×N)的三维矩阵;
    所述处理单元,用于分别获取所述M个标注图像中的每个标注图像对应的至少一个(A×B×N)的子矩阵以得到多个(A×B×N)的子矩阵;由所述多个(A×B×N)的子矩阵得到多个包括P个元素的向量;其中,由所述多个(A×B×N)的子矩阵中的每一个子矩阵得到一个包括P个元素的向量;将所述多个包括P个元素的向量进行主成分分析以得到一个(P×P)的二维矩阵;将所述(P×P)的二维矩阵包括的一个(C×P)的子矩阵作为所述变换矩阵W。
  12. 根据权利要求8至11任一项所述的装置,其特征在于,所述处理单元,具体用于对所述输入图像进行卷积操作得到第一特征图,对第(K-1)特征图进行卷积操作得到第K特征图;所述第K特征图为一个所述参考分辨率的特征图,所述第(K-1)特征图的分辨 率不高于所述第K特征图的分辨率,K为大于1的整数,所述多个特征图包括K个特征图;对所述第一特征图进行下采样得到一个所述参考分辨率的特征图,以及对所述第(K-1)特征图进行下采样得到一个所述参考分辨率的特征图。
  13. 根据权利要求12所述的装置,其特征在于,所述处理单元,具体用于将所述多个具有参考分辨率的特征图在通道维度上进行拼接以得到所述至少一个特征图组;所述特征图组为一个(H×W×C)的三维矩阵且对应所述C个(H×W)的二维矩阵;分别计算所述特征图组中的每个元素位置对应的一维矩阵与所述变换矩阵的乘积,得到所述(H×W)个均包括P个元素的一维矩阵;所述特征图组中的一个元素位置对应的一维矩阵包括的元素为所述C个(H×W)的二维矩阵中的每个二维矩阵中同一元素位置的元素。
  14. 根据权利要求8至13任一项所述的装置,其特征在于,所述处理单元,还用于获得所述变换矩阵W;使用卷积神经网络对训练样本做处理,得到所述训练样本的图像分割结果;所述训练样本包含于所述训练数据;根据所述训练样本的图像分割结果和所述训练样本对应的标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述卷积神经网络处理所述训练样本期望得到的结果;利用所述训练样本对应的损失,通过优化算法更新所述卷积神经网络的参数;
    所述处理单元,具体用于将所述输入图像输入到所述卷积神经网络进行多层次的特征提取,得到所述多个特征图。
  15. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。
PCT/CN2020/077366 2019-03-01 2020-03-01 图像分割方法和图像处理装置 WO2020177651A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/383,181 US12008797B2 (en) 2019-03-01 2021-07-22 Image segmentation method and image processing apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910157603.5A CN110033003B (zh) 2019-03-01 2019-03-01 图像分割方法和图像处理装置
CN201910157603.5 2019-03-01

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/383,181 Continuation US12008797B2 (en) 2019-03-01 2021-07-22 Image segmentation method and image processing apparatus

Publications (1)

Publication Number Publication Date
WO2020177651A1 true WO2020177651A1 (zh) 2020-09-10

Family

ID=67235047

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077366 WO2020177651A1 (zh) 2019-03-01 2020-03-01 图像分割方法和图像处理装置

Country Status (3)

Country Link
US (1) US12008797B2 (zh)
CN (1) CN110033003B (zh)
WO (1) WO2020177651A1 (zh)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116620A (zh) * 2020-09-16 2020-12-22 北京交通大学 一种室内图像语义分割与涂装展示的方法
CN112634282A (zh) * 2020-12-18 2021-04-09 北京百度网讯科技有限公司 图像处理方法、装置以及电子设备
CN112949651A (zh) * 2021-01-29 2021-06-11 Oppo广东移动通信有限公司 特征提取方法、装置、存储介质及电子设备
CN113159159A (zh) * 2021-04-15 2021-07-23 东北大学 一种基于改进cnn的小样本图像分类方法
CN113240611A (zh) * 2021-05-28 2021-08-10 中建材信息技术股份有限公司 一种基于图片序列的异物检测方法
CN113298709A (zh) * 2021-04-06 2021-08-24 广东省科学院智能制造研究所 一种基于几何变换原理的图像视角变换方法
CN113408571A (zh) * 2021-05-08 2021-09-17 浙江智慧视频安防创新中心有限公司 一种基于模型蒸馏的图像分类方法、装置、存储介质及终端
CN113793345A (zh) * 2021-09-07 2021-12-14 复旦大学附属华山医院 一种基于改进注意力模块的医疗影像分割方法及装置
CN113887542A (zh) * 2021-12-06 2022-01-04 深圳小木科技有限公司 目标检测方法、电子设备及存储介质
CN114022960A (zh) * 2022-01-05 2022-02-08 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质
CN114387357A (zh) * 2020-10-16 2022-04-22 北京迈格威科技有限公司 图像处理方法、装置、电子设备及存储介质

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866526A (zh) * 2018-08-28 2020-03-06 北京三星通信技术研究有限公司 图像分割方法、电子设备及计算机可读存储介质
US20200137380A1 (en) * 2018-10-31 2020-04-30 Intel Corporation Multi-plane display image synthesis mechanism
CN110033003B (zh) * 2019-03-01 2023-12-15 华为技术有限公司 图像分割方法和图像处理装置
CN110348537B (zh) * 2019-07-18 2022-11-29 北京市商汤科技开发有限公司 图像处理方法及装置、电子设备和存储介质
CN110472670B (zh) * 2019-07-24 2022-03-01 上海联影智能医疗科技有限公司 图像中线检测方法、计算机设备和存储介质
CN110415258B (zh) * 2019-07-29 2022-04-29 深圳市商汤科技有限公司 图像处理方法及装置、电子设备和存储介质
CN110443322A (zh) * 2019-08-16 2019-11-12 北京知道创宇信息技术股份有限公司 图像处理方法、装置、服务器及可读存储介质
CN111062964B (zh) * 2019-11-28 2023-07-14 深圳市华尊科技股份有限公司 图像分割方法及相关装置
CN112950462B (zh) * 2019-12-11 2024-03-08 北京金山云网络技术有限公司 一种图像处理方法、装置、电子设备及存储介质
US10902551B1 (en) * 2019-12-17 2021-01-26 X Development Llc True positive transplant
CN111210439B (zh) * 2019-12-26 2022-06-24 中国地质大学(武汉) 通过抑制非感兴趣信息的语义分割方法、设备及存储设备
US20210279565A1 (en) * 2020-03-04 2021-09-09 WootCloud Inc. Systems And Methods For Device Fingerprinting
CN113395441A (zh) * 2020-03-13 2021-09-14 华为技术有限公司 图像留色方法及设备
CN111402166A (zh) * 2020-03-18 2020-07-10 上海嘉沃光电科技有限公司 图像去噪方法及装置、服务终端及计算机可读存储介质
CN113496507A (zh) * 2020-03-20 2021-10-12 华为技术有限公司 一种人体三维模型重建方法
US11348336B2 (en) * 2020-05-13 2022-05-31 International Business Machines Corporation Systems and approaches for learning efficient representations for video understanding
CN113674146A (zh) * 2020-05-15 2021-11-19 微软技术许可有限责任公司 图像超分辨率
CN111652129A (zh) * 2020-06-02 2020-09-11 北京联合大学 一种基于语义分割和多特征融合的车辆前障碍物检测方法
CN111832568B (zh) * 2020-06-12 2024-01-12 北京百度网讯科技有限公司 车牌识别方法、车牌识别模型的训练方法和装置
CN111932563B (zh) * 2020-09-23 2021-07-06 平安科技(深圳)有限公司 图片区域分割方法、装置、电子设备及存储介质
CN112651364B (zh) * 2020-12-31 2023-06-20 北京市商汤科技开发有限公司 图像处理方法、装置、电子设备及存储介质
CN112954454B (zh) * 2021-02-08 2023-09-05 北京奇艺世纪科技有限公司 一种视频帧生成方法及装置
CN113065575A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种图像处理方法及相关装置
CN115700771A (zh) * 2021-07-31 2023-02-07 华为技术有限公司 编解码方法及装置
CN113601306B (zh) * 2021-08-04 2022-07-08 上海电器科学研究所(集团)有限公司 基于一维分割网络的充电设施箱体焊缝打磨方法
CN113569873B (zh) * 2021-08-19 2024-03-29 支付宝(杭州)信息技术有限公司 一种图像的处理方法、装置及设备
CN114004973B (zh) * 2021-12-30 2022-12-27 深圳比特微电子科技有限公司 用于图像语义分割的解码器及其实现方法
CN114596620B (zh) * 2022-05-10 2022-08-05 深圳市海清视讯科技有限公司 人脸识别设备补光控制方法、装置、设备及存储介质
CN114677567B (zh) * 2022-05-27 2022-10-14 成都数联云算科技有限公司 模型训练方法、装置、存储介质及电子设备
CN114693830B (zh) * 2022-05-27 2022-11-15 阿里巴巴达摩院(杭州)科技有限公司 医学影像的多器官分割、模型训练方法、设备及介质
US20230394762A1 (en) * 2022-06-01 2023-12-07 Rovi Guides, Inc. Systems and methods for neural-network based video encoding
WO2024065536A1 (en) * 2022-09-29 2024-04-04 Intel Corporation Methods and apparatus for image segmentation on small datasets
CN115760986B (zh) * 2022-11-30 2023-07-25 北京中环高科环境治理有限公司 基于神经网络模型的图像处理方法及装置
CN116206114B (zh) * 2023-04-28 2023-08-01 成都云栈科技有限公司 一种复杂背景下人像提取方法及装置
CN117476509B (zh) * 2023-12-27 2024-03-19 联合富士半导体有限公司 一种用于半导体芯片产品的激光雕刻装置及控制方法
CN117788850B (zh) * 2024-02-21 2024-05-10 深圳欧税通技术有限公司 一种商标相似度评估方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794504A (zh) * 2015-04-28 2015-07-22 浙江大学 基于深度学习的图形图案文字检测方法
CN108876793A (zh) * 2018-04-13 2018-11-23 北京迈格威科技有限公司 语义分割方法、装置和系统及存储介质
CN109034162A (zh) * 2018-07-13 2018-12-18 南京邮电大学 一种图像语义分割方法
CN110033003A (zh) * 2019-03-01 2019-07-19 华为技术有限公司 图像分割方法和图像处理装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191753A1 (en) * 2011-01-20 2012-07-26 John Nicholas Gross System & Method For Assessing & Responding to Intellectual Property Rights Proceedings/Challenges
CN106651877B (zh) 2016-12-20 2020-06-02 北京旷视科技有限公司 实例分割方法及装置
US10679351B2 (en) * 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
CN108010031B (zh) 2017-12-15 2020-12-04 厦门美图之家科技有限公司 一种人像分割方法及移动终端
CN112967218B (zh) * 2021-03-15 2022-03-18 复旦大学 一种基于线框和边缘结构的多尺度图像修复系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794504A (zh) * 2015-04-28 2015-07-22 浙江大学 基于深度学习的图形图案文字检测方法
CN108876793A (zh) * 2018-04-13 2018-11-23 北京迈格威科技有限公司 语义分割方法、装置和系统及存储介质
CN109034162A (zh) * 2018-07-13 2018-12-18 南京邮电大学 一种图像语义分割方法
CN110033003A (zh) * 2019-03-01 2019-07-19 华为技术有限公司 图像分割方法和图像处理装置

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116620B (zh) * 2020-09-16 2023-09-22 北京交通大学 一种室内图像语义分割与涂装展示的方法
CN112116620A (zh) * 2020-09-16 2020-12-22 北京交通大学 一种室内图像语义分割与涂装展示的方法
CN114387357A (zh) * 2020-10-16 2022-04-22 北京迈格威科技有限公司 图像处理方法、装置、电子设备及存储介质
CN112634282B (zh) * 2020-12-18 2024-02-13 北京百度网讯科技有限公司 图像处理方法、装置以及电子设备
CN112634282A (zh) * 2020-12-18 2021-04-09 北京百度网讯科技有限公司 图像处理方法、装置以及电子设备
CN112949651A (zh) * 2021-01-29 2021-06-11 Oppo广东移动通信有限公司 特征提取方法、装置、存储介质及电子设备
CN113298709A (zh) * 2021-04-06 2021-08-24 广东省科学院智能制造研究所 一种基于几何变换原理的图像视角变换方法
CN113159159A (zh) * 2021-04-15 2021-07-23 东北大学 一种基于改进cnn的小样本图像分类方法
CN113159159B (zh) * 2021-04-15 2023-09-29 东北大学 一种基于改进cnn的小样本图像分类方法
CN113408571A (zh) * 2021-05-08 2021-09-17 浙江智慧视频安防创新中心有限公司 一种基于模型蒸馏的图像分类方法、装置、存储介质及终端
CN113408571B (zh) * 2021-05-08 2022-07-19 浙江智慧视频安防创新中心有限公司 一种基于模型蒸馏的图像分类方法、装置、存储介质及终端
CN113240611B (zh) * 2021-05-28 2024-05-07 中建材信息技术股份有限公司 一种基于图片序列的异物检测方法
CN113240611A (zh) * 2021-05-28 2021-08-10 中建材信息技术股份有限公司 一种基于图片序列的异物检测方法
CN113793345A (zh) * 2021-09-07 2021-12-14 复旦大学附属华山医院 一种基于改进注意力模块的医疗影像分割方法及装置
CN113793345B (zh) * 2021-09-07 2023-10-31 复旦大学附属华山医院 一种基于改进注意力模块的医疗影像分割方法及装置
CN113887542A (zh) * 2021-12-06 2022-01-04 深圳小木科技有限公司 目标检测方法、电子设备及存储介质
CN113887542B (zh) * 2021-12-06 2022-04-05 孙晖 目标检测方法、电子设备及存储介质
CN114022960B (zh) * 2022-01-05 2022-06-14 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质
CN114022960A (zh) * 2022-01-05 2022-02-08 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质

Also Published As

Publication number Publication date
US12008797B2 (en) 2024-06-11
CN110033003B (zh) 2023-12-15
US20210350168A1 (en) 2021-11-11
CN110033003A (zh) 2019-07-19

Similar Documents

Publication Publication Date Title
WO2020177651A1 (zh) 图像分割方法和图像处理装置
CN110188795B (zh) 图像分类方法、数据处理方法和装置
US20210398252A1 (en) Image denoising method and apparatus
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
CN111402130B (zh) 数据处理方法和数据处理装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
CN110473137B (zh) 图像处理方法和装置
CN112446270B (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2021043273A1 (zh) 图像增强方法和装置
WO2020186703A1 (en) Convolutional neural network-based image processing method and image processing apparatus
CN112446380A (zh) 图像处理方法和装置
WO2021018106A1 (zh) 行人检测方法、装置、计算机可读存储介质和芯片
US12039440B2 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN111797881B (zh) 图像分类方法及装置
CN111695673B (zh) 训练神经网络预测器的方法、图像处理方法及装置
WO2022179606A1 (zh) 一种图像处理方法及相关装置
CN112529904A (zh) 图像语义分割方法、装置、计算机可读存储介质和芯片
US20220157041A1 (en) Image classification method and apparatus
CN113284055A (zh) 一种图像处理的方法以及装置
CN111833363B (zh) 图像边缘和显著性检测方法及装置
CN114693986A (zh) 主动学习模型的训练方法、图像处理方法及装置
CN113011562B (zh) 一种模型训练方法及装置
CN111797882B (zh) 图像分类方法及装置
CN117975029A (zh) 图像处理方法、模型训练方法及相关产品
CN113011562A (zh) 一种模型训练方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20765626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20765626

Country of ref document: EP

Kind code of ref document: A1