CN114648762A - Semantic segmentation method and device, electronic equipment and computer-readable storage medium - Google Patents

Semantic segmentation method and device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN114648762A
CN114648762A CN202210272072.6A CN202210272072A CN114648762A CN 114648762 A CN114648762 A CN 114648762A CN 202210272072 A CN202210272072 A CN 202210272072A CN 114648762 A CN114648762 A CN 114648762A
Authority
CN
China
Prior art keywords
target
image
training
area
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210272072.6A
Other languages
Chinese (zh)
Inventor
聂聪冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210272072.6A priority Critical patent/CN114648762A/en
Publication of CN114648762A publication Critical patent/CN114648762A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a semantic segmentation method, a semantic segmentation device, electronic equipment and a computer-readable storage medium; in the embodiment of the application, an image to be segmented is obtained, and the image to be segmented is divided to obtain an initial region image corresponding to the image to be segmented; determining target pixel context information in the initial area image, and determining a target area image corresponding to the initial area image according to the target pixel context information and the initial area image; determining target area context information between the target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information; and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result. The embodiment of the application can reduce the calculation complexity. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, automatic driving and the like.

Description

Semantic segmentation method and device, electronic equipment and computer-readable storage medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a semantic segmentation method, apparatus, electronic device, and computer-readable storage medium.
Background
The semantic segmentation is to classify each pixel in the image, and the pixels belonging to the same class are divided into the same region, so that the image is segmented.
In order to improve the segmentation precision of the image, attention mechanism is added in semantic segmentation to calculate the context information of the image, and the method is high in calculation complexity.
Disclosure of Invention
Embodiments of the present application provide a semantic segmentation method, an apparatus, an electronic device, and a computer-readable storage medium, which can solve the technical problem of high computational complexity in semantic segmentation.
A method of semantic segmentation, comprising:
acquiring an image to be segmented, and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
determining context information of target pixels in the initial area image, and determining a target area image corresponding to the initial area image according to the context information of the target pixels and the initial area image;
determining target area context information between the target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information;
and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
Accordingly, an embodiment of the present application provides a semantic segmentation apparatus, including:
the image acquisition module is used for acquiring an image to be segmented and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
a first determining module, configured to determine context information of a target pixel in the initial area image, and determine a target area image corresponding to the initial area image according to the context information of the target pixel and the initial area image;
a second determining module, configured to determine context information of a target area between the target area images, and determine a target pixel feature of the image to be segmented according to the target area image and the context information of the target area;
and the image segmentation module is used for segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
Optionally, the image acquisition module is specifically configured to perform:
extracting the features of the image to be segmented to obtain a feature image;
acquiring a preset grid, and performing displacement prediction on the vertex of the preset grid according to the characteristic image and the preset grid to obtain a target deformation grid;
and dividing the characteristic image according to the target deformation grid to obtain an initial area image corresponding to the image to be segmented.
Optionally, the image acquisition module is specifically configured to perform:
determining the vertex coordinates in the preset grid;
according to the vertex coordinates, performing vertex feature search in the feature image to obtain initial features corresponding to the vertex coordinates;
determining a target feature corresponding to the vertex coordinate according to the initial feature and the context information of the vertex coordinate;
and predicting the target displacement of the vertex coordinates according to the target characteristics, and moving the vertex coordinates according to the target displacement to obtain a target deformation grid.
Optionally, the second determining module is specifically configured to perform:
determining initial area characteristics corresponding to the target area image according to the average value of the pixels of the target area image;
determining target area context information between the target area images according to the initial area characteristics;
determining the target area characteristics according to the initial area characteristics and the target area context information;
and mapping the target region characteristics to the target deformation grid to obtain the target pixel characteristics of the image to be segmented.
Optionally, the semantic segmentation apparatus further includes:
a training module to:
acquiring a training sample set, and determining a target weight corresponding to the category of a training pixel according to the number of the training pixels in a training image of the training sample set;
dividing the training image to obtain a training area image;
determining initial pixel context information in the training area image through a pixel context layer in a neural network model to be trained, and determining a first area image corresponding to the training image according to the initial pixel context information and the training area image;
determining initial region context information between the first region images through a region context layer in the neural network model to be trained, and determining initial pixel characteristics of the training images according to the first region images and the initial region context information;
determining the target class of the training pixel corresponding to the initial pixel characteristic through a segmentation layer in the neural network model to be trained;
determining a target loss value according to the target type, the label of the training pixel and the target weight;
acquiring the training times of the neural network model to be trained;
and training the neural network model to be trained based on the target loss value and the training times to obtain the trained neural network model.
Optionally, the training module is specifically configured to perform:
extracting the features of the training images through a feature extraction layer in the neural network model to be trained to obtain training feature images;
through the deformation network layer in the neural network model to be trained, performing displacement prediction on the vertexes of the preset grids according to the training characteristic image and the preset grids to obtain initial deformation grids;
dividing the training characteristic image according to the initial deformation grid through the pixel context layer in the neural network model to be trained to obtain a training area image;
determining a first target loss value according to the target type, the label of the training pixel and the target weight;
determining a second target loss value according to the initial pixel characteristics and the average value of the initial pixel characteristics;
determining a target loss value according to the first target loss value and the second target loss value;
if the target loss value does not meet a preset condition and/or the training frequency of the neural network model to be trained is smaller than a preset threshold value, increasing 1 for the training frequency, updating the network parameters of the feature extraction layer, the pixel context layer, the region context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction through the feature extraction layer in the neural network model to be trained, and performing the feature extraction on the training image to obtain a training feature image;
and if the target loss value meets a preset condition and/or the training times of the neural network model to be trained are equal to a preset threshold value, stopping training to obtain the trained neural network model.
Optionally, the training module is specifically configured to perform:
determining the sub-area of each grid in the initial deformation grid and the total area of the image to be segmented;
determining a third target loss value according to the sub-area and the total area;
determining a target loss value based on the first target loss value, the second target loss value, and the third target loss value;
and if the target loss value does not meet a preset condition and/or the training frequency of the neural network model to be trained is smaller than a preset threshold value, adding 1 to the training frequency, updating the network parameters of the feature extraction layer, the pixel context layer, the region context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value and the third target loss value, returning to execute the feature extraction of the training image through the feature extraction layer in the neural network model to be trained, and obtaining a training feature image.
In addition, an electronic device is further provided in an embodiment of the present application, and includes a processor and a memory, where the memory stores a computer program, and the processor is configured to run the computer program in the memory to implement the semantic segmentation method provided in the embodiment of the present application.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program is suitable for being loaded by a processor to perform any one of the semantic segmentation methods provided in the embodiment of the present application.
In addition, the present application also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements any one of the semantic segmentation methods provided in the present application.
In the embodiment of the application, an image to be segmented is obtained first, and the image to be segmented is divided to obtain an initial region image corresponding to the image to be segmented. And then determining the context information of the target pixels in the initial area image, and determining the target area image corresponding to the initial area image according to the context information of the target pixels and the initial area image. And then determining target area context information between the target area images, and determining the target pixel characteristics of the image to be segmented according to the target area images and the target area context information. And finally, identifying and segmenting the image to be segmented according to the characteristics of the target pixel.
In other words, in the embodiment of the present application, first, context information of a target pixel in an initial area image is determined, a target area image corresponding to the initial area image is determined according to the context information of the target pixel and the initial area image, then, context information of the target area between the target area images is determined, and a target pixel feature of an image to be segmented is determined according to the target area image and the context information of the target area image, that is, by determining the context information of the target area between two target area images, context information between each pixel in the target area image and all pixels in other target area images does not need to be calculated, that is, context information between each pixel in the image to be segmented and all other pixels in the image to be segmented does not need to be calculated, thereby reducing calculation complexity.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a scene schematic diagram of a semantic segmentation process provided by an embodiment of the present application;
FIG. 2 is a flow chart of a semantic segmentation method provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of an initial region image provided by an embodiment of the present application;
FIG. 4 is a diagram of a preset mesh and a target deformed mesh provided in an embodiment of the present application;
FIG. 5 is a flowchart illustrating a training method of a neural network model to be trained according to an embodiment of the present disclosure;
FIG. 6 is a schematic structural diagram of a neural network model to be trained according to an embodiment of the present application;
FIG. 7 is a schematic flow chart diagram illustrating a method for applying a trained neural network model provided in an embodiment of the present application;
FIG. 8 is a schematic diagram of a segmentation result of a trained neural network model provided by an embodiment of the present application;
fig. 9 is a schematic structural diagram of a semantic segmentation apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a semantic segmentation method, a semantic segmentation device, electronic equipment and a computer-readable storage medium. The semantic segmentation apparatus may be integrated in an electronic device, and the electronic device may be a server or a terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform.
And, a plurality of servers can be grouped into a blockchain, and the servers are nodes on the blockchain.
The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, and the like, but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
For example, as shown in fig. 1, the terminal may obtain an image to be segmented, and divide the image to be segmented to obtain an initial region image corresponding to the image to be segmented; determining target pixel context information in the initial area image, and determining a target area image corresponding to the initial area image according to the target pixel context information and the initial area image; determining target area context information between target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information; and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
In addition, "a plurality" in the embodiments of the present application means two or more. "first" and "second" and the like in the embodiments of the present application are used for distinguishing the description, and are not to be construed as implying relative importance.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, automatic driving, remote sensing and the like. Autonomous driving typically includes techniques for high-precision maps, environmental awareness, behavioral decision-making, path planning, motion control, and the like.
The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
Referring to fig. 2, fig. 2 is a schematic flow chart illustrating a semantic segmentation method according to an embodiment of the present disclosure. The semantic segmentation method can comprise the following steps:
s201, obtaining an image to be segmented, and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented.
The image to be segmented refers to an image containing a region of interest of the user. For example, the image a includes a vehicle, a tree, and a sky, and if the user wants to obtain the vehicle image in the image a, the image a is an image to be segmented.
The image to be segmented may be an image of an automatic driving scene, or may be a remote sensing image, and the type of the image to be segmented is not limited herein.
The terminal may capture the image to be segmented by using a camera thereof when receiving the acquisition instruction. Or, the terminal may locally acquire the image to be segmented when receiving the acquisition instruction.
Or, when receiving the acquisition instruction, the terminal may forward the acquisition instruction to another terminal, and the other terminal acquires the image to be segmented locally based on the acquisition instruction, or performs shooting based on the acquisition instruction, thereby acquiring the image to be segmented. And then other terminals send the image to be segmented to the terminal, and the terminal acquires the image to be segmented.
For the way of acquiring the image to be segmented by the terminal, the user may select the way according to the actual situation, which is not limited in this embodiment.
After the terminal acquires the image to be segmented, the image to be segmented can be divided based on the preset grid, and an initial area image corresponding to the image to be segmented is obtained.
The division refers to dividing the image into the respective area images, but the respective area images also have a connection relationship. For example, the image to be segmented is shown in fig. 3, the initial area image may be shown in fig. 3, and the object refers to an object in the image to be segmented, for example, the object may be a vehicle, a person, a tree, or the like.
Because the position of the target in each image to be segmented is different, if the image to be segmented is directly segmented based on the preset mesh, the obtained initial area image is not accurate, namely, the edge of the preset mesh is not matched with the boundary of the target in the image to be segmented. Therefore, in order to improve the accuracy of the initial region image, in some embodiments, the dividing the image to be segmented to obtain the initial region image corresponding to the image to be segmented includes:
performing feature extraction on an image to be segmented to obtain a feature image;
acquiring a preset grid, and performing displacement prediction on the vertex of the preset grid according to the characteristic image and the preset grid to obtain a target deformation grid;
and dividing the characteristic image according to the target deformation grid to obtain an initial area image corresponding to the image to be segmented.
In this embodiment, the mesh may be initialized, so as to obtain the preset mesh. Wherein, the mesh can be initialized according to the following three aspects:
in the first aspect, the topological structure of the mesh before and after deformation remains unchanged, i.e. the topological structure of the preset mesh is the same as the topological structure of the target deformed mesh.
In a second aspect, the edges of the preset mesh have flexibility and diversity, so that the target deformation mesh can be obtained after the preset mesh deviates with as little displacement as possible.
And in the third aspect, the number of the grids of the preset grid is constant, so that the subsequent structured batch processing of the images is facilitated.
The mesh is initialized according to the three aspects described above, and the obtained preset mesh may be as shown in fig. 4.
After the preset mesh is obtained, the terminal can displace the vertex of the preset mesh according to the characteristic image to obtain the target deformation mesh, so that the edge of the target deformation mesh is matched with the boundary of the target in the characteristic image, namely, the same target in the characteristic image is in the same mesh in the target deformation mesh, namely, the target color homogenization (same color) or the semantic homogenization (belonging to the same category) in the same mesh in the target deformation mesh is obtained.
For example, after the vertices of the preset mesh in fig. 4 are displaced, the obtained target deformation mesh may be as shown in fig. 4.
And after the terminal obtains the target deformation grid, dividing the characteristic image according to the target deformation grid to obtain an initial area image corresponding to the image to be segmented.
In this embodiment, after the feature of the image to be segmented is extracted to obtain the feature image, the vertex of the preset mesh is subjected to displacement prediction according to the feature image and the preset mesh to obtain a target deformation mesh, and then the feature image is divided according to the target deformation mesh to obtain an initial region image corresponding to the image to be segmented, so that the mesh edge of the target deformation mesh is matched with the boundary of the target in the image.
In some embodiments, performing displacement prediction on vertices of a preset mesh according to the feature image and the preset mesh to obtain a target deformation mesh, includes:
determining vertex coordinates in a preset grid;
according to the vertex coordinates, performing vertex feature search in the feature image to obtain initial features corresponding to the vertex coordinates;
determining target characteristics corresponding to the vertex coordinates according to the initial characteristics and the context information of the vertex coordinates;
and predicting the target displacement of the vertex coordinates according to the target characteristics, and moving the vertex coordinates according to the target displacement to obtain the target deformation mesh.
In the target detection of an image, the position deviation of the center point, the top left corner vertex or the bottom right corner vertex of a detection frame is predicted by a fully connected layer or a convolutional layer. However, the vertexes or center points of different detection frames in the target detection do not interfere with each other on the same image. Semantic segmentation is a dense prediction task, i.e. the vertices of each mesh are not isolated from each other, but connected to each other by edges of the mesh, and are a set of points where context dependencies exist. Therefore, in this embodiment, after the initial feature corresponding to the vertex is obtained, the target feature corresponding to the vertex coordinate is determined according to the initial feature and the context information of the vertex, so that the target displacement can be obtained more accurately, and the target deformation mesh can be obtained more accurately.
Determining a target feature corresponding to the vertex coordinates according to the initial feature and the context information of the vertex, including:
and performing point-by-point feature transformation on the initial features to obtain intermediate features, and then determining target features according to the intermediate features and the context information of the vertexes.
It should be noted that the context information of the vertex may be obtained through a self-attention mechanism layer of a deformation network layer in a trained neural network, the initial feature may be transformed point by point, which may be implemented by a point by point convolution layer in the deformation network layer in the trained neural network, and then the target feature corresponding to the vertex coordinate may be determined according to the initial feature and the context information of the vertex through a prediction convolution layer of the deformation network layer in the trained neural network.
Moreover, the deformation network layer can comprise a plurality of layers of point-by-point convolution layers and a self-attention mechanism layer, so that multiple feature transformations on the initial features are realized, and context information of the vertex is obtained through multiple times of acquisition. For example, the morphable network layer may include 6 point-by-point convolutional layers and a self-attention mechanism layer.
In order to obtain the intermediate features more accurately, after the initial features are obtained, the vertex coordinates and the initial features may be fused (for example, the vertex coordinates and the initial features are connected in series) to obtain candidate features, and then the candidate features are transformed point by point to obtain the intermediate features.
Wherein the vertex coordinates can be fused with the initial features through the CoordConv layer of the morphable network layer in the trained neural network.
In other embodiments, extracting features of an image to be segmented to obtain a feature image includes: and performing feature extraction on the image to be segmented through a feature extraction layer in the trained neural network model to obtain a feature image.
Wherein, the feature extraction layer in the trained neural network model can be a residual neural network (ResNet) or a convolution network. Optionally, when the feature extraction layer is a residual neural network, in order to reduce the loss of resolution and expand the receptive field at the same time, a part of the pooling layer in the residual neural network may be replaced with an expanded convolution layer, that is, at this time, the feature extraction layer is a residual neural network including an expanded convolution layer.
S202, determining context information of target pixels in the initial area image, and determining a target area image corresponding to the initial area image according to the context information of the target pixels and the initial area image.
After obtaining the initial area image, the terminal acquires context information of target pixels among pixels in the initial area image, and then determines a target area image corresponding to the initial area image according to the context information of the target pixels and the initial area image.
For example, if there are pixel 1, pixel 2, and pixel 3 in the initial region image a, the target pixel context information between pixel 1 and pixel 2 is calculated, the target pixel context information between pixel 1 and pixel 3 is calculated, and the target pixel context information between pixel 2 and pixel 3 is calculated.
The target pixel context information in the initial region image can be determined through a pixel context layer in the trained neural network model, and the target region image corresponding to the initial region image is determined according to the target pixel context information and the initial region image, wherein the pixel context layer can be a self-attention mechanism layer.
Or, the context information of the target pixel in the initial area image can be determined through an Euclidean distance algorithm or a cosine similarity algorithm, and the target area image corresponding to the initial area image is determined according to the context information of the target pixel and the initial area image.
Context information refers to interaction information between different objects. The pixel context information refers to interaction information between different pixels in the same image.
S203, determining the context information of the target area between the target area images, and determining the target pixel characteristics of the image to be segmented according to the target area images and the context information of the target area.
In the related art, by means of the self-attention mechanism layer, not only the context information of the target pixels among the pixels in the initial area image but also the context information among the pixels in different initial area images need to be calculated, which results in higher calculation complexity.
For example, if there are pixels 1 and 2 in the initial area image a and pixels 4 and 5 in the initial area image b, then through the attention mechanism layer, not only the target pixel context information between pixels 1 and 2 and the target pixel context information between pixels 4 and 5 are calculated, but also the target pixel context information between pixels 1 and 4, the target pixel context information between pixels 1 and 5, the target pixel context information between pixels 2 and 4, and the target pixel context information between pixels 2 and 5 are calculated.
In the embodiment, after obtaining the context of the target pixel between the pixels in the initial region image, the context information of the target region between the target region images is calculated, and then the target pixel feature of the image to be segmented is determined according to the target region image and the context information of the target region, without calculating the context information of the pixels between the pixels of different target region images, thereby reducing the calculation complexity, and realizing higher semantic segmentation accuracy on the basis of lower calculation complexity and calculation redundancy.
For example, if there are pixels 1 and 2 in the initial region image a, pixels 4 and 5 in the initial region image b, pixels 6 and 7 in the target region image a1, and pixels 8 and 9 in the target region image b1, in the present embodiment, the region context information between the target region image a1 and the target region image b1 is calculated, so that there is no need to calculate the target pixel context information between the pixels 1 and 4, the target pixel context information between the pixels 1 and 5, the target pixel context information between the pixels 2 and 4, and the target pixel context information between the pixels 2 and 5.
The context information of the target area between the images of the target area can be determined through the context layer of the area in the trained neural network model, and the target pixel characteristics of the image to be segmented are determined according to the image of the target area and the context information of the target area.
Or, the context information of the target areas between the target area images can be determined through an Euclidean distance algorithm or a cosine similarity algorithm, and the target pixel characteristics of the image to be segmented are determined according to the target area images and the context information of the target areas.
In some embodiments, determining target area context information between target area images, and determining target pixel characteristics of an image to be segmented according to the target area images and the target area context information includes:
determining initial area characteristics corresponding to the target area image according to the mean value of the pixels of the target area image;
determining target area context information between the target area images according to the initial area characteristics;
determining the target area characteristics according to the initial area characteristics and the target area context information;
and mapping the target region characteristics to a target deformation grid to obtain the target pixel characteristics of the image to be segmented.
In this embodiment, the mean value of the pixels of the target area image is used as the initial area feature corresponding to the target area image, and then the target area context information between the target area images is calculated according to the initial area feature corresponding to the target area image, so that the target area context information between two target area images only needs to be calculated once, and does not need to be calculated for many times, thereby greatly reducing the calculation complexity.
When the target area context information between the target area images is determined through the area context layer in the trained neural network model, and the target pixel characteristics of the image to be segmented are determined according to the target area images and the target area context information, the initial area characteristics corresponding to the target area images can be determined through the area pooling layer in the area context layer according to the mean value of the pixels of the target area images. The target region context information between the target region images can be determined according to the initial region characteristics through a region-level self-attention mechanism layer in the region context layer, and the target region characteristics can be determined according to the initial region characteristics and the target region context information. The target region characteristics can be mapped to the target deformation grid through the region anti-pooling layer in the region context layer, and the target pixel characteristics of the image to be segmented are obtained.
And S204, segmenting the image to be segmented according to the target pixel characteristics to obtain segmentation results.
And after the terminal obtains the target pixel characteristics, segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
In some embodiments, segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result, including: and segmenting the image to be segmented according to the target pixel characteristics through the segmentation layer in the trained neural network model to obtain a segmentation result.
Optionally, since the target area image includes detail information and the feature image includes all information, in order to achieve more accurate segmentation, the image to be segmented may be segmented according to the feature image, the target area image and the target pixel feature through a segmentation layer in the trained neural network model, so as to obtain a segmentation result.
In other embodiments, when the context information of the target pixels and the image of the target area are determined by a context layer of pixels in a trained neural network model, the context information of the target area and the feature of the target pixels are determined by a context layer of areas in the trained neural network model, and the segmentation result is determined by a segmentation layer in the trained neural network model, the method further comprises:
acquiring a training sample set, and determining target weights corresponding to the categories of training pixels according to the number of the training pixels in training images of the training sample set;
dividing the training image to obtain a training area image;
determining initial pixel context information in a training area image through a pixel context layer in a neural network model to be trained, and determining a first area image corresponding to the training image according to the initial pixel context information and the training area image;
determining initial region context information between first region images through a region context layer in a neural network model to be trained, and determining initial pixel characteristics of a training image according to the first region images and the initial region context information;
determining a target class of a training pixel corresponding to the initial pixel characteristic through a segmentation layer in a neural network model to be trained;
determining a target loss value according to the target category, the label of the training pixel and the target weight;
acquiring the training times of a neural network model to be trained;
and training the neural network model to be trained based on the target loss value and the training times to obtain the trained neural network model.
The number of training pixels in the training image of the training sample set can be substituted into the following formula to obtain the target weight corresponding to the category of the training pixels:
Figure BDA0003553904770000141
Figure BDA0003553904770000151
where i denotes the ith class of training pixels, niRepresenting the number of pixels of the ith class, z representing the number of classes, qiRepresenting the frequency, q, corresponding to the pixel of the i-th classmMedian of the frequency, wiRepresenting objects corresponding to pixels of the ith categoryAnd (4) weighting.
Substituting the target category, the label of the training pixel and the target weight into the following formula to obtain a target loss value:
Figure BDA0003553904770000152
wherein, yiLabels representing training pixels, piRepresents the object class, LcRepresenting the target loss value.
Because the number difference of the pixels of each category in the training sample set is large, the target loss value is calculated by adopting a weighted cross entropy loss function, and the target weight is determined according to the number of the training pixels, so that overfitting is avoided, and the accuracy of semantic segmentation of the trained neural network model is higher.
Optionally, determining initial region context information between the first region images through a region context layer in the neural network model to be trained, and determining an initial pixel feature of the training image according to the first region images and the initial region context information, including:
determining a first area characteristic corresponding to the first area image according to the mean value of the pixels of the first area image;
determining initial area context information between the first area images according to the first area characteristics, and obtaining second area characteristics according to the initial area context information and the first area characteristics;
and mapping the second region characteristics to the initial deformation grid to obtain the initial pixel characteristics of the training image.
Optionally, training the neural network model to be trained based on the target loss value and the training times to obtain a trained neural network model, including:
and if the training times are smaller than a preset threshold value, increasing 1 for the training times, updating the network parameters of the neural network model to be trained according to the target loss value, and then returning to execute the division of the training image to obtain a training area image. And if the training times are equal to a preset threshold value, taking the neural network model to be trained as the trained neural network model.
Or training the neural network model to be trained based on the target loss value and the training times to obtain a trained neural network model, which also includes:
and if the target loss value does not meet the preset condition, updating the network parameters of the neural network model to be trained according to the target loss value, and then returning to execute to divide the training image to obtain a training area image. And if the target loss value meets the preset condition, taking the neural network model to be trained as the trained neural network model.
Or, training the neural network model to be trained based on the target loss value and the training frequency to obtain a trained neural network model, which also includes:
if the target loss value does not meet the preset condition and the training times are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the neural network model to be trained according to the target loss value, and then returning to execute to divide the training image to obtain a training area image;
and if the target loss value meets the preset condition and the training times are equal to the preset threshold value, taking the neural network model to be trained as the trained neural network model.
When the target deformation grid is obtained through a deformation network layer in a trained neural network model, dividing a training image to obtain a training area image, wherein the method comprises the following steps:
extracting the features of the training images through a feature extraction layer in the neural network model to be trained to obtain training feature images;
performing displacement prediction on the vertex of a preset grid according to a training characteristic image and the preset grid through a deformation network layer in a neural network model to be trained to obtain an initial deformation grid;
dividing the training characteristic image according to the initial deformation grid through a pixel context layer in the neural network model to be trained to obtain a training area image;
determining a target loss value according to the target class, the label of the training pixel and the target weight, wherein the method comprises the following steps:
determining a first target loss value according to the target category, the label of the training pixel and the target weight;
determining a second target loss value according to the initial pixel characteristics and the average value of the initial pixel characteristics;
determining a target loss value according to the first target loss value and the second target loss value;
training the neural network model to be trained based on the target loss value and the training times to obtain a trained neural network model, comprising:
if the target loss value does not meet the preset condition and/or the training times of the neural network model to be trained are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the network parameters of the pixel context layer, the network parameters of the region context layer and the network parameters of the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction layer in the neural network model to be trained, and performing feature extraction on the training image to obtain a training feature image;
and if the target loss value meets the preset condition and/or the training times of the neural network model to be trained are equal to the preset threshold value, stopping training to obtain the trained neural network model.
Because the positions of more boundary points or catastrophe points in the image may be positions of the displaced vertices in the preset mesh, that is, the positions of more boundary points or catastrophe points in the image may coincide with the displaced vertices in the preset mesh, the accuracy of obtaining the target displacement of the vertices in the preset mesh is low. Moreover, optimizing the positions of the vertices in the preset mesh independently ignores the topological relation among the vertices in the preset mesh and the overall distribution of the preset mesh.
Therefore, in this embodiment, a second target loss value is set, and then when the target loss value does not satisfy the preset condition and/or the training frequency of the neural network model to be trained is smaller than the preset threshold, the deformation network layer in the neural network model to be trained is updated according to the second target loss value, so that when the target displacement of the vertex in the preset mesh is predicted through the deformation network layer, the predicted target displacement is more accurate.
The second target loss value may be obtained by substituting the initial pixel feature and the average value of the initial pixel feature into the following formula to calculate:
Figure BDA0003553904770000171
Lvarrepresenting a second target loss value, N representing the number of first region images, j representing the jth first region image, pmCoordinates representing the features of the mth initial pixel in the jth first region image, fmRepresenting the m-th initial pixel feature in the j-th first region image, fjRepresenting the mean of the initial pixel characteristics of the jth first region image. | | non-woven hair2Representing a 2-norm operation.
Through deformation network layer in the neural network model of treating training, carry out displacement prediction to the summit of presetting the mesh according to training characteristic image and presetting the mesh, obtain initial deformation mesh, include:
determining vertex coordinates in a preset grid;
according to the vertex coordinates, performing vertex feature search in the training feature image to obtain first features corresponding to the vertex coordinates;
determining a second feature corresponding to the vertex coordinate according to the first feature and the context information of the vertex coordinate;
and predicting the initial displacement of the vertex coordinates according to the second characteristics, and moving the vertex coordinates according to the initial displacement to obtain the initial deformation mesh.
In other embodiments, determining the target loss value based on the first target loss value and the second target loss value includes:
determining the sub-area of each grid in the initial deformation grid and the total area of the image to be segmented;
determining a third target loss value according to the sub-area and the total area;
determining a target loss value according to the first target loss value, the second target loss value and the third target loss value;
if the target loss value does not meet the preset condition and/or the training times of the neural network model to be trained are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the network parameters of the pixel context layer, the network parameters of the region context layer and the network parameters of the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction layer in the neural network model to be trained, and performing feature extraction on the training image to obtain a training feature image, wherein the training feature image comprises:
if the target loss value does not meet the preset condition and/or the training times of the neural network model to be trained are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the network parameters of the pixel context layer, the network parameters of the region context layer and the network parameters of the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value and the third target loss value, returning to execute the feature extraction layer in the neural network model to be trained, and performing feature extraction on the training image to obtain the training feature image.
In order to avoid the problem that adjacent grids in the initial deformed grid are crossed, that is, the grids in the initial deformed grid are overlapped, so that the sum of the sub-areas of all the grids in the initial deformed grid exceeds the total area of the image to be segmented, the embodiment further updates the deformed network layer according to the third target loss value, thereby avoiding the phenomenon that the grids of the initial deformed grid are crossed, and further, when the target displacement of the vertex in the preset grid is predicted through the deformed network layer, the predicted target displacement is more accurate, namely, the target deformed grid obtained through prediction of the deformed network layer cannot be crossed.
Wherein the sub-area and the total area may be substituted into the following formula to obtain a third target loss value:
Figure BDA0003553904770000181
Larearepresenting a third target loss value, areajRepresenting the sub-area of the jth first region image, i.e. the sub-area of the jth grid, areaimgRepresenting the total area of the image to be segmented.
Substituting the first target loss value, the second target loss value and the third target loss value into the following formula to obtain a target loss value L:
L=Lc+αLvar+βLarea
α represents a weight of the first target loss value, and β represents a weight of the second target loss value.
It should be noted that, when the number of times of training is used as a condition for determining whether the neural network model to be trained terminates training, determining the target loss value according to the first target loss value and the second target loss value may also refer to using the first target loss value and the second target loss value as target loss values, that is, the target loss values include the first target loss value and the second target loss value.
Similarly, when the number of times of training is used as a condition for judging whether the neural network model to be trained terminates training, determining the target loss value according to the first target loss value, the second target loss value and the third target loss value may also refer to using the first target loss value, the second target loss value and the third target loss value as the target loss value, that is, the target loss value includes the first target loss value, the second target loss value and the third target loss value.
As can be seen from the above, in the embodiment of the present application, an image to be segmented is obtained first, and the image to be segmented is divided to obtain an initial region image corresponding to the image to be segmented. And then determining the context information of the target pixels in the initial area image, and determining the target area image corresponding to the initial area image according to the context information of the target pixels and the initial area image. And then determining target area context information between the target area images, and determining the target pixel characteristics of the image to be segmented according to the target area images and the target area context information. And finally, identifying and segmenting the image to be segmented according to the characteristics of the target pixel.
In other words, in the embodiment of the present application, first, context information of a target pixel in an initial area image is determined, a target area image corresponding to the initial area image is determined according to the context information of the target pixel and the initial area image, then, context information of the target area between the target area images is determined, and a target pixel feature of an image to be segmented is determined according to the target area image and the context information of the target area image, that is, by determining the context information of the target area between two target area images, context information between each pixel in the target area image and all pixels in other target area images does not need to be calculated, that is, context information between each pixel in the image to be segmented and all other pixels in the image to be segmented does not need to be calculated, thereby reducing calculation complexity.
The method described in the above embodiments is further illustrated in detail by way of example.
In this embodiment, the semantic segmentation apparatus is integrated in the terminal as an example. The semantic segmentation method may include an application method of the trained neural network model and a training method of the trained neural network model. Specifically, fig. 5 is a schematic flow chart of a training method of the trained neural network model, and fig. 6 is a schematic flow chart of an application method of the trained neural network model.
Referring to fig. 5, the training method of the trained neural network model includes:
s501, the terminal obtains a training sample set, and determines target weights corresponding to the categories of training pixels according to the number of the training pixels in training images of the training sample set.
S502, the terminal extracts the features of the training image through a feature extraction layer in the neural network model to be trained to obtain a training feature image.
In this embodiment, the structure of the neural network model to be trained may be as shown in fig. 6, where the neural network model to be trained includes a feature extraction layer, a pixel context layer, a region context layer, a deformation network layer, and a segmentation layer, where the feature extraction layer may be a residual neural network, and then a training feature image F is obtained through the feature extraction layer.
It should be understood that after the training feature image is obtained, the training feature image may be subjected to dimension reduction to obtain a training feature image X after dimension reduction, and then the training feature image X after dimension reduction is input to the deformation network layer and the pixel context layer.
S503, the terminal determines vertex coordinates in a preset grid through a deformation network layer in the neural network model to be trained, and performs vertex feature search in the training feature image according to the vertex coordinates to obtain first features corresponding to the vertex coordinates.
For example, as shown in fig. 6, the morphable network layer includes an extraction layer, a CoordConv layer, a 6-layer point-by-point convolutional layer, a self-attention layer and a prediction convolutional layer, and the terminal determines a vertex coordinate (t) in the predetermined mesh through the extraction layer in the morphable network layerr,sr) And searching the vertex characteristics in the training characteristic image according to the vertex coordinates to obtain first characteristics corresponding to the vertex coordinates.
S504, the terminal determines a second feature corresponding to the vertex coordinate according to the first feature and context information of the vertex coordinate through a deformation network layer in the neural network model to be trained, predicts the initial displacement of the vertex coordinate according to the second feature, and moves the vertex coordinate according to the initial displacement to obtain an initial deformation grid.
The terminal can fuse the first feature and the vertex coordinate through a CoordConv layer in the deformed network layer to obtain an initial candidate feature. Then the terminal performs point-by-point convolution on the initial candidate features through the point-by-point convolution layer in the deformation network layer to obtain initial intermediate features, and then obtains the initial intermediate features through the self-attention layer in the deformation network layerAnd taking context information of the vertex coordinates, determining a second feature corresponding to the vertex coordinates according to the initial intermediate feature and the context information of the vertex coordinates, and executing a point-by-point convolution process and a self-attention layer process for 6 times. Finally, the terminal predicts the initial displacement (delta t) of the vertex coordinate according to the second characteristic through predicting the convolution layerr,Δsr) And moving the vertex coordinates according to the initial displacement to obtain an initial deformation grid.
And S505, the terminal divides the training characteristic image according to the initial deformation grid through the pixel context layer in the neural network model to be trained to obtain a training area image.
S506, the terminal determines initial pixel context information in the training area image through a pixel context layer in the neural network model to be trained, and determines a first area image corresponding to the training image according to the initial pixel context information and the training area image.
For training area images
Figure BDA0003553904770000211
The first region image may be
Figure BDA0003553904770000212
Wherein, KjRepresents pixels in the jth training region image, and
Figure BDA0003553904770000213
the respective first region images can constitute a new feature map X' e RC×H×WWherein, R represents dimension, C represents channel number, H represents height of training image, and W represents width of training image.
And S507, the terminal determines a first area characteristic corresponding to the first area image according to the mean value of the pixels of the first area image through an area pooling layer of an area context layer in the neural network model to be trained.
S508, the terminal determines initial area context information between first area images according to the first area features through an area-level self-attention mechanism layer of an area context layer in the neural network model to be trained, and obtains second area features according to the initial area context information and the first area features.
And S509, the terminal maps the second region characteristic to the initial deformation grid through a region inverse pooling layer of a region context layer in the neural network model to be trained to obtain the initial pixel characteristic of the training image.
S5010, the terminal determines the target class of the training pixels corresponding to the initial pixel features according to the initial pixel features, the first area images and the training feature images through the segmentation layer in the neural network model to be trained.
S5011, the terminal determines a first target loss value according to the target type, the label of the training pixel and the target weight.
S5012, the terminal determines a second target loss value according to the initial pixel characteristics and the average value of the initial pixel characteristics.
S5013, the terminal determines the sub-areas of the grids in the initial deformation grid and the total area of the image to be segmented, and determines a third target loss value according to the sub-areas and the total area.
S5014, the terminal obtains the training times of the neural network model to be trained.
And S5015, if the training times are smaller than a preset threshold value, the terminal increases the training times by 1, updates the network parameters of the feature extraction layer, the pixel context layer, the area context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updates the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value and the third target loss value, and returns to execute the step S502.
S5016, if the training times are equal to the preset threshold, the terminal takes the neural network model to be trained as the trained neural network model.
Referring to fig. 7, the method of applying the trained neural network model includes:
s701, the terminal obtains an image to be segmented, and performs feature extraction on the image to be segmented through a feature extraction layer in the trained neural network model to obtain a feature image.
S702, the terminal obtains a preset grid, determines a vertex coordinate in the preset grid through a deformation network layer in the trained neural network model, and performs vertex feature search in the feature image according to the vertex coordinate to obtain an initial feature corresponding to the vertex coordinate.
S703, the terminal determines a target feature corresponding to the vertex coordinate according to the initial feature and the context information of the vertex coordinate through a deformation network layer in the trained neural network model, predicts the target displacement of the vertex coordinate according to the target feature, and moves the vertex coordinate according to the target displacement to obtain the target deformation grid.
And S704, the terminal divides the characteristic image according to the target deformation grid through the pixel context layer in the trained neural network model to obtain an initial area image corresponding to the image to be segmented.
S705, the terminal determines target pixel context information in the initial area image through the trained pixel context layer in the neural network model, and determines a target area image corresponding to the initial area image according to the target pixel context information and the initial area image.
S706, the terminal determines the initial region characteristics corresponding to the target region image according to the mean value of the pixels of the target region image through the region pooling layer of the region context layer in the trained neural network model.
S707, the terminal determines the target area context information between the target area images through the area level self-attention layer of the area context layer in the trained neural network model, and determines the target area characteristics according to the initial area characteristics and the target area context information.
And S708, mapping the target region characteristics to a target deformation grid by the terminal through a region inverse pooling layer of a region context layer in the trained neural network model to obtain the target pixel characteristics of the image to be segmented.
For example, the obtained target pixel feature may adopt X ∈ RC×H×W
And S709, the terminal segments the image to be segmented according to the target pixel feature, the target area image and the feature image through the segmentation layer in the trained neural network model to obtain a segmentation result.
Wherein, the split layer can be a softmax layer. For example, the segmentation result obtained by the trained neural network model can be shown in fig. 8, where the dashed box in fig. 8 represents the region of interest.
The beneficial effects and other realizable manners in this embodiment may specifically refer to the above semantic segmentation method embodiment, which is not described herein again.
In order to better implement the semantic segmentation method provided by the embodiment of the present application, the embodiment of the present application further provides a device based on the semantic segmentation method. The meaning of the noun is the same as that in the above semantic segmentation method, and specific implementation details can refer to the description in the method embodiment.
For example, as shown in fig. 9, the semantic segmentation means may include:
the image obtaining module 901 is configured to obtain an image to be segmented, and divide the image to be segmented to obtain an initial region image corresponding to the image to be segmented.
A first determining module 902, configured to determine context information of a target pixel in an initial area image, and determine a target area image corresponding to the initial area image according to the context information of the target pixel and the initial area image.
A second determining module 903, configured to determine context information of a target area between the target area images, and determine a target pixel feature of the image to be segmented according to the target area image and the context information of the target area.
And an image segmentation module 904, configured to segment the image to be segmented according to the target pixel feature to obtain a segmentation result.
Optionally, the image obtaining module 901 is specifically configured to perform:
performing feature extraction on an image to be segmented to obtain a feature image;
acquiring a preset grid, and performing displacement prediction on the vertex of the preset grid according to the characteristic image and the preset grid to obtain a target deformation grid;
and dividing the characteristic image according to the target deformation grid to obtain an initial area image corresponding to the image to be segmented.
Optionally, the image obtaining module 901 is specifically configured to perform:
determining vertex coordinates in a preset grid;
according to the vertex coordinates, performing vertex feature search in the feature image to obtain initial features corresponding to the vertex coordinates;
determining a target feature corresponding to the vertex coordinate according to the initial feature and the context information of the vertex coordinate;
and predicting the target displacement of the vertex coordinates according to the target characteristics, and moving the vertex coordinates according to the target displacement to obtain the target deformation mesh.
Optionally, the second determining module 903 is specifically configured to perform:
determining initial area characteristics corresponding to the target area image according to the mean value of the pixels of the target area image;
determining target area context information between target area images according to the initial area characteristics;
determining the target area characteristics according to the initial area characteristics and the target area context information;
and mapping the target region characteristics to a target deformation grid to obtain the target pixel characteristics of the image to be segmented.
Optionally, the semantic segmentation apparatus further includes:
a training module to perform:
acquiring a training sample set, and determining target weights corresponding to the categories of training pixels according to the number of the training pixels in training images of the training sample set;
dividing the training image to obtain a training area image;
determining initial pixel context information in a training area image through a pixel context layer in a neural network model to be trained, and determining a first area image corresponding to the training image according to the initial pixel context information and the training area image;
determining initial region context information between first region images through a region context layer in a neural network model to be trained, and determining initial pixel characteristics of a training image according to the first region images and the initial region context information;
determining a target class of a training pixel corresponding to the initial pixel characteristic through a segmentation layer in a neural network model to be trained;
determining a target loss value according to the target category, the label of the training pixel and the target weight;
acquiring the training times of a neural network model to be trained;
and training the neural network model to be trained based on the target loss value and the training times to obtain the trained neural network model.
Optionally, the training module is specifically configured to perform:
extracting the features of the training images through a feature extraction layer in the neural network model to be trained to obtain training feature images;
performing displacement prediction on the vertex of a preset grid according to a training characteristic image and the preset grid through a deformation network layer in a neural network model to be trained to obtain an initial deformation grid;
dividing the training characteristic image according to the initial deformation grid through a pixel context layer in the neural network model to be trained to obtain a training area image;
determining a first target loss value according to the target category, the label of the training pixel and the target weight;
determining a second target loss value according to the initial pixel characteristics and the average value of the initial pixel characteristics;
determining a target loss value according to the first target loss value and the second target loss value;
if the target loss value does not meet the preset condition and/or the training times of the neural network model to be trained are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the network parameters of the pixel context layer, the network parameters of the region context layer and the network parameters of the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction layer in the neural network model to be trained, and performing feature extraction on the training image to obtain a training feature image;
and if the target loss value meets a preset condition and/or the training times of the neural network model to be trained are equal to a preset threshold value, stopping training to obtain the trained neural network model.
Optionally, the training module is specifically configured to perform:
determining the sub-area of each grid in the initial deformation grid and the total area of the image to be segmented;
determining a third target loss value according to the sub-area and the total area;
determining a target loss value according to the first target loss value, the second target loss value and the third target loss value;
if the target loss value does not meet the preset condition and/or the training times of the neural network model to be trained are smaller than the preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the network parameters of the pixel context layer, the network parameters of the region context layer and the network parameters of the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value and the third target loss value, returning to execute the feature extraction layer in the neural network model to be trained, and performing feature extraction on the training image to obtain the training feature image.
In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and the specific implementation manner and the corresponding beneficial effects of the above modules may refer to the foregoing method embodiments, which are not described herein again.
An embodiment of the present application further provides an electronic device, where the electronic device may be a server or a terminal, and as shown in fig. 10, a schematic structural diagram of the electronic device according to the embodiment of the present application is shown, specifically:
the electronic device may include components such as a processor 1001 of one or more processing cores, memory 1002 of one or more computer-readable storage media, a power source 1003, and an input unit 1004. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 10 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 1001 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing computer programs and/or modules stored in the memory 1002 and calling data stored in the memory 1002. Optionally, processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1001.
The memory 1002 may be used to store computer programs and modules, and the processor 1001 executes various functional applications and data processing by operating the computer programs and modules stored in the memory 1002. The memory 1002 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 1002 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 access to the memory 1002.
The electronic device further includes a power source 1003 for supplying power to each component, and preferably, the power source 1003 may be logically connected to the processor 1001 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are implemented through the power management system. The power source 1003 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further include an input unit 1004, and the input unit 1004 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 1001 in the electronic device loads the executable file corresponding to the process of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 runs the computer programs stored in the memory 1002, so as to implement various functions, such as:
acquiring an image to be segmented, and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
determining target pixel context information in the initial area image, and determining a target area image corresponding to the initial area image according to the target pixel context information and the initial area image;
determining target area context information between target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information;
and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
The above detailed embodiments of the operations and the corresponding beneficial effects can be referred to the above detailed description of the semantic segmentation method, which is not repeated herein.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.
To this end, the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the semantic segmentation methods provided in the present application. For example, the computer program may perform the steps of:
acquiring an image to be segmented, and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
determining target pixel context information in the initial area image, and determining a target area image corresponding to the initial area image according to the target pixel context information and the initial area image;
determining target area context information between target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information;
and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
The specific implementation of the above operations and the corresponding beneficial effects can be referred to the foregoing embodiments, and are not described herein again.
Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the computer program stored in the computer-readable storage medium can execute the steps in any semantic segmentation method provided in the embodiments of the present application, beneficial effects that can be achieved by any semantic segmentation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the semantic segmentation method.
The semantic segmentation method, the semantic segmentation device, the electronic device, and the computer-readable storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the description to explain the principles and implementations of the present application, and the description of the embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method of semantic segmentation, comprising:
acquiring an image to be segmented, and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
determining target pixel context information in the initial area image, and determining a target area image corresponding to the initial area image according to the target pixel context information and the initial area image;
determining target area context information between the target area images, and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information;
and segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
2. The semantic segmentation method according to claim 1, wherein the step of dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented comprises:
performing feature extraction on the image to be segmented to obtain a feature image;
acquiring a preset grid, and performing displacement prediction on the vertex of the preset grid according to the characteristic image and the preset grid to obtain a target deformation grid;
and dividing the characteristic image according to the target deformation grid to obtain an initial area image corresponding to the image to be segmented.
3. The semantic segmentation method according to claim 2, wherein the determining of the context information of the target area between the target area images and the determining of the target pixel feature of the image to be segmented according to the target area images and the context information of the target area comprise:
determining initial area characteristics corresponding to the target area image according to the mean value of the pixels of the target area image;
determining target area context information between the target area images according to the initial area characteristics;
determining the target area characteristics according to the initial area characteristics and the target area context information;
and mapping the target region characteristics to the target deformation grid to obtain the target pixel characteristics of the image to be segmented.
4. The semantic segmentation method according to any one of claims 1-3, wherein the target pixel context information and the target area image are determined by a pixel context layer in a trained neural network model, the target area context information and the target pixel feature are determined by a region context layer in the trained neural network model, and the segmentation result is determined by a segmentation layer in the trained neural network model, the method further comprising:
acquiring a training sample set, and determining a target weight corresponding to the category of a training pixel according to the number of the training pixels in a training image of the training sample set;
dividing the training image to obtain a training area image;
determining initial pixel context information in the training area image through a pixel context layer in a neural network model to be trained, and determining a first area image corresponding to the training image according to the initial pixel context information and the training area image;
determining initial region context information between the first region images through a region context layer in the neural network model to be trained, and determining initial pixel characteristics of the training images according to the first region images and the initial region context information;
determining the target class of the training pixel corresponding to the initial pixel feature through a segmentation layer in the neural network model to be trained;
determining a target loss value according to the target category, the label of the training pixel and the target weight;
acquiring the training times of the neural network model to be trained;
and training the neural network model to be trained based on the target loss value and the training times to obtain the trained neural network model.
5. The semantic segmentation method according to claim 4, wherein the target deformation mesh is obtained through a deformation network layer in the trained neural network model;
the dividing the training image to obtain a training area image includes:
performing feature extraction on the training image through a feature extraction layer in the neural network model to be trained to obtain a training feature image;
performing displacement prediction on the vertex of the preset mesh according to the training characteristic image and the preset mesh through the deformation network layer in the neural network model to be trained to obtain an initial deformation mesh;
dividing the training characteristic image according to the initial deformation grid through the pixel context layer in the neural network model to be trained to obtain a training area image;
determining a target loss value according to the target class, the label of the training pixel and the target weight, including:
determining a first target loss value according to the target category, the label of the training pixel and the target weight;
determining a second target loss value according to the initial pixel characteristic and the average value of the initial pixel characteristic;
determining a target loss value according to the first target loss value and the second target loss value;
the training the neural network model to be trained based on the target loss value and the training times to obtain the trained neural network model, including:
if the target loss value does not meet a preset condition and/or the training times of the neural network model to be trained are smaller than a preset threshold value, increasing 1 for the training times, updating the network parameters of the feature extraction layer, the pixel context layer, the region context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction of the training image through the feature extraction layer in the neural network model to be trained, and obtaining a training feature image;
and if the target loss value meets a preset condition and/or the training times of the neural network model to be trained are equal to a preset threshold value, stopping training to obtain the trained neural network model.
6. The semantic segmentation method according to claim 5, wherein determining a target loss value based on the first target loss value and the second target loss value comprises:
determining the sub-area of each grid in the initial deformation grid and the total area of the image to be segmented;
determining a third target loss value according to the sub-area and the total area;
determining a target loss value according to the first target loss value, the second target loss value and the third target loss value;
if the target loss value does not meet a preset condition and/or the training frequency of the neural network model to be trained is smaller than a preset threshold value, adding 1 to the training frequency, updating the network parameters of the feature extraction layer, the pixel context layer, the region context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value, returning to execute the feature extraction through the feature extraction layer in the neural network model to be trained, and performing the feature extraction on the training image to obtain a training feature image, wherein the method comprises the following steps:
if the target loss value does not meet a preset condition and/or the training frequency of the neural network model to be trained is smaller than a preset threshold value, increasing 1 for the training frequency, updating the network parameters of the feature extraction layer, the pixel context layer, the region context layer and the segmentation layer in the neural network model to be trained according to the first target loss value, updating the network parameters of the deformation network layer in the neural network model to be trained according to the second target loss value and the third target loss value, and returning to execute the feature extraction of the training image through the feature extraction layer in the neural network model to be trained to obtain a training feature image.
7. A semantic segmentation apparatus, comprising:
the image acquisition module is used for acquiring an image to be segmented and dividing the image to be segmented to obtain an initial region image corresponding to the image to be segmented;
a first determining module, configured to determine context information of a target pixel in the initial area image, and determine a target area image corresponding to the initial area image according to the context information of the target pixel and the initial area image;
the second determining module is used for determining target area context information between the target area images and determining target pixel characteristics of the image to be segmented according to the target area images and the target area context information;
and the image segmentation module is used for segmenting the image to be segmented according to the target pixel characteristics to obtain a segmentation result.
8. An electronic device comprising a processor and a memory, the memory storing a computer program, the processor being configured to execute the computer program in the memory to perform the semantic segmentation method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the semantic segmentation method according to any one of claims 1 to 6.
10. A computer program product, characterized in that it stores a computer program adapted to be loaded by a processor for performing the semantic segmentation method according to any one of claims 1 to 6.
CN202210272072.6A 2022-03-18 2022-03-18 Semantic segmentation method and device, electronic equipment and computer-readable storage medium Pending CN114648762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210272072.6A CN114648762A (en) 2022-03-18 2022-03-18 Semantic segmentation method and device, electronic equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210272072.6A CN114648762A (en) 2022-03-18 2022-03-18 Semantic segmentation method and device, electronic equipment and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN114648762A true CN114648762A (en) 2022-06-21

Family

ID=81995141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210272072.6A Pending CN114648762A (en) 2022-03-18 2022-03-18 Semantic segmentation method and device, electronic equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN114648762A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115063800A (en) * 2022-08-16 2022-09-16 阿里巴巴(中国)有限公司 Text recognition method and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120008299A (en) * 2010-07-16 2012-01-30 광운대학교 산학협력단 Adaptive filtering apparatus and method for intra prediction based on characteristics of prediction block regions
CN105957066A (en) * 2016-04-22 2016-09-21 北京理工大学 CT image liver segmentation method and system based on automatic context model
CN110473159A (en) * 2019-08-20 2019-11-19 Oppo广东移动通信有限公司 Image processing method and device, electronic equipment, computer readable storage medium
US20200201344A1 (en) * 2018-12-21 2020-06-25 Here Global B.V. Method and apparatus for the detection and labeling of features of an environment through contextual clues
CN113240687A (en) * 2021-05-17 2021-08-10 Oppo广东移动通信有限公司 Image processing method, image processing device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120008299A (en) * 2010-07-16 2012-01-30 광운대학교 산학협력단 Adaptive filtering apparatus and method for intra prediction based on characteristics of prediction block regions
CN105957066A (en) * 2016-04-22 2016-09-21 北京理工大学 CT image liver segmentation method and system based on automatic context model
US20200201344A1 (en) * 2018-12-21 2020-06-25 Here Global B.V. Method and apparatus for the detection and labeling of features of an environment through contextual clues
CN110473159A (en) * 2019-08-20 2019-11-19 Oppo广东移动通信有限公司 Image processing method and device, electronic equipment, computer readable storage medium
CN113240687A (en) * 2021-05-17 2021-08-10 Oppo广东移动通信有限公司 Image processing method, image processing device, electronic equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VIJAY BADRINARAYANAN等: "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 12, 1 December 2017 (2017-12-01), pages 2481, XP055942927, DOI: 10.1109/TPAMI.2016.2644615 *
余航等: "基于上下文分析的无监督分层迭代算法用于SAR图像分割", 自动化学报, vol. 40, no. 1, 15 January 2014 (2014-01-15), pages 100 - 116 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115063800A (en) * 2022-08-16 2022-09-16 阿里巴巴(中国)有限公司 Text recognition method and electronic equipment

Similar Documents

Publication Publication Date Title
CN110555481B (en) Portrait style recognition method, device and computer readable storage medium
CN112131978B (en) Video classification method and device, electronic equipment and storage medium
CN112419368A (en) Method, device and equipment for tracking track of moving target and storage medium
CN111709497B (en) Information processing method and device and computer readable storage medium
CN111666919B (en) Object identification method and device, computer equipment and storage medium
KR102462934B1 (en) Video analysis system for digital twin technology
CN112052837A (en) Target detection method and device based on artificial intelligence
CN111339343A (en) Image retrieval method, device, storage medium and equipment
CN112232355B (en) Image segmentation network processing method, image segmentation device and computer equipment
CN111709471B (en) Object detection model training method and object detection method and device
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN113807399A (en) Neural network training method, neural network detection method and neural network detection device
EP4404148A1 (en) Image processing method and apparatus, and computer-readable storage medium
CN113033507B (en) Scene recognition method and device, computer equipment and storage medium
CN114283351A (en) Video scene segmentation method, device, equipment and computer readable storage medium
CN113076963B (en) Image recognition method and device and computer readable storage medium
CN112052771A (en) Object re-identification method and device
CN113128526B (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN114611692A (en) Model training method, electronic device, and storage medium
CN114764870A (en) Object positioning model processing method, object positioning device and computer equipment
CN112883827B (en) Method and device for identifying specified target in image, electronic equipment and storage medium
CN114648762A (en) Semantic segmentation method and device, electronic equipment and computer-readable storage medium
Wang et al. Salient object detection using biogeography-based optimization to combine features
CN111008622B (en) Image object detection method and device and computer readable storage medium
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination