WO2021093435A1 - 语义分割网络结构的生成方法、装置、设备及存储介质 - Google Patents

语义分割网络结构的生成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021093435A1
WO2021093435A1 PCT/CN2020/114372 CN2020114372W WO2021093435A1 WO 2021093435 A1 WO2021093435 A1 WO 2021093435A1 CN 2020114372 W CN2020114372 W CN 2020114372W WO 2021093435 A1 WO2021093435 A1 WO 2021093435A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
semantic segmentation
network structure
super
segmentation network
Prior art date
Application number
PCT/CN2020/114372
Other languages
English (en)
French (fr)
Inventor
孙鹏
吴家祥
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021093435A1 publication Critical patent/WO2021093435A1/zh
Priority to US17/515,180 priority Critical patent/US20220051056A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • This application relates to artificial intelligence technology, and relates to a method, device, electronic device, and computer-readable storage medium for generating a semantic segmentation network structure.
  • Artificial Intelligence is a comprehensive technology of computer science. Through the study of the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive subject, covering a wide range of fields, such as natural language processing technology and machine learning/deep learning. With the development of technology, artificial intelligence technology will be applied in more fields and will be used more and more. The more important value.
  • Semantic segmentation is one of the important applications in the field of artificial intelligence. It is widely used in autonomous driving, real-time video editing, face recognition systems, intelligent hardware, etc., that is, semantic segmentation is the basic component of these complex systems.
  • the current semantic segmentation network structure is relatively simple and fixed, and through the fixed semantic segmentation network structure, it is impossible to recognize the content and the corresponding position in the image in real time.
  • the embodiments of the present application provide a method, device, electronic device, and computer-readable storage medium for generating a semantic segmentation network structure, which can dynamically adjust the semantic segmentation network structure, thereby improving the performance of real-time segmentation.
  • An embodiment of the present application provides a method for generating a semantic segmentation network structure, the method is executed by an electronic device, the semantic segmentation network structure includes a super unit and an aggregation unit, and the method includes:
  • the semantic segmentation network structure is optimized based on image samples, and redundant units in the super-unit to which the target unit belongs are removed to obtain an improved semantic segmentation network structure, wherein the target unit is the one with the largest architecture parameter among the units unit;
  • feature fusion is performed on the output of the super unit from which the redundant unit is removed, to obtain a fused feature map
  • the embodiment of the present application provides a method for semantic segmentation of an image, which is executed by an electronic device and applied to the semantic segmentation network structure after training;
  • the method includes:
  • the object and the position corresponding to the object are marked by a preset marking method.
  • An embodiment of the present application provides a device for generating a semantic segmentation network structure, the device including:
  • Add a module configured to generate corresponding architecture parameters for each unit that constitutes a super unit in the semantic segmentation network structure
  • the removal module is configured to optimize the semantic segmentation network structure based on image samples and remove redundant units in the super-unit to which the target unit belongs to obtain an improved semantic segmentation network structure, wherein the target unit is in each unit The unit with the largest architectural parameters;
  • a fusion module configured to perform feature fusion on the output of the super unit from which the redundant unit is removed through the aggregation unit in the improved semantic segmentation network structure, to obtain a fused feature map
  • the training module is configured to perform recognition processing on the fused feature map and determine the position corresponding to the object existing in the image sample; based on the position corresponding to the object existing in the image sample and the corresponding position of the image sample To obtain the trained semantic segmentation network structure by training the improved semantic segmentation network structure.
  • An embodiment of the present application provides an image semantic segmentation device, the device includes:
  • the determining module is configured to determine the image to be semantically segmented
  • the processing module is configured to perform recognition processing on the image to be semantically segmented through the trained semantic segmentation network structure, determine the object existing in the image to be semantically segmented and the position corresponding to the object, and label it by preset Ways to mark the object and the position corresponding to the object.
  • An embodiment of the present application provides an electronic device for generating a semantic segmentation network structure, including:
  • Memory used to store executable instructions
  • the processor is configured to implement the method for generating the semantic segmentation network structure provided in the embodiment of the present application when executing the executable instructions stored in the memory.
  • An embodiment of the present application provides an electronic device for semantic segmentation of an image, including:
  • Memory used to store executable instructions
  • the processor is configured to implement the image semantic segmentation method provided in the embodiment of the present application when executing the executable instructions stored in the memory.
  • the embodiment of the present application provides a computer-readable storage medium that stores executable instructions that are used to cause the processor to execute the method for generating the semantic segmentation network structure provided by the embodiment of the present application, or implement the method provided by the embodiment of the present application Image semantic segmentation method.
  • the redundant calculation unit in the semantic segmentation network structure is removed, which saves the calculation amount of subsequent semantic segmentation, realizes the dynamic adjustment of the semantic segmentation network structure, and reduces the depth of the semantic segmentation network structure;
  • feature fusion is performed on the output of the super unit from which the redundant unit is removed, thereby adaptively fusing the output of the super unit with different resolutions to improve the performance of real-time segmentation.
  • FIG. 1 is a schematic diagram of an application scenario of a semantic segmentation network structure generation system 10 provided by an embodiment of the present application;
  • FIG. 2 is a schematic structural diagram of an electronic device 500 for generating a semantic segmentation network structure provided by an embodiment of the present application
  • 3-6 is a schematic flowchart of a method for generating a semantic segmentation network structure provided by an embodiment of the present application
  • FIG. 7 is a schematic diagram of an application scenario of the image semantic segmentation system 20 provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device 600 for semantic segmentation of images provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of an image semantic segmentation method provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a supercell structure provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a unit structure provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a semantic segmentation network structure provided by an embodiment of the present application.
  • Fig. 13 is a schematic structural diagram of an aggregation unit provided by an embodiment of the present application.
  • first ⁇ second involved only distinguishes similar objects, and does not represent a specific order for the objects. Understandably, “first ⁇ second” can be used if permitted.
  • the specific order or sequence is exchanged, so that the embodiments of the present application described herein can be implemented in a sequence other than those illustrated or described herein.
  • Image recognition The technology of using computers to process, analyze and understand images to identify targets and objects in various patterns is a practical application of deep learning algorithms. Image recognition technology is generally divided into face recognition and item recognition. Face recognition is mainly used in security inspection, identity verification and mobile payment; item recognition is mainly used in the circulation of goods, especially unmanned shelves, smart retail cabinets, etc. Retail field.
  • Target detection Also called target extraction, it is a kind of image segmentation based on the geometric and statistical characteristics of the target. It combines the segmentation and recognition of the target into one. Its accuracy and real-time performance are an important ability of the entire system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic target extraction and recognition are particularly important.
  • target extraction also called target extraction, it is a kind of image segmentation based on the geometric and statistical characteristics of the target. It combines the segmentation and recognition of the target into one. Its accuracy and real-time performance are an important ability of the entire system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic target extraction and recognition are particularly important.
  • Dynamic real-time tracking and positioning of targets are used in intelligent transportation systems, intelligent monitoring systems, and military target detection. And the positioning of surgical instruments in medical navigation surgery has a wide range of application values.
  • Unit It is composed of at least one node in the neural network.
  • the unit in the embodiment of the present application can be composed of two nodes (a first intermediate node and a second intermediate node). For example, the k-1th unit and the kth The output result of the unit is input to the first intermediate node in the k+1 unit. After the first intermediate node is processed, the output result of the first intermediate node is input to the second intermediate node in the k+1 unit, and the second intermediate node performs After processing, the output result of the second intermediate node is input into the k+2th unit.
  • Super unit composed of units of the same stage (resolution), for example, the resolution of the k-1 unit and the k-th unit is 128*128, and the resolution of the k+1 unit and the k+2 unit is 64*64 , Then the k-1 unit and the k-th unit constitute a super-unit, and the k+1 unit and the k+2-th unit constitute another super-unit.
  • Semantic segmentation Fine-grained reasoning is achieved by intensive prediction and label inference for each pixel in the image, so that each pixel is marked with its category. That is, to identify the content and location of the image by searching for the category of all pixels in the image.
  • the semantic segmentation network structure described in the embodiments of this application can be applied to various recognition fields, such as image recognition neural networks, target detection neural networks, face detection neural networks, automatic driving systems, and other recognition fields.
  • the semantic segmentation network structure of is not limited to a certain recognition field.
  • embodiments of the present application provide a method, device, electronic device, and computer-readable storage medium for generating a semantic segmentation network structure, which can dynamically adjust the semantic segmentation network structure, thereby improving the performance of real-time segmentation. , Reduce computational complexity, save computational cost, and apply the trained semantic segmentation network structure to subsequent semantic segmentation operations.
  • the following describes exemplary applications of the electronic device for generating the semantic segmentation network structure provided by the embodiment of the application.
  • the electronic device for generating the semantic segmentation network structure provided by the embodiment of the application may be a server, for example, a server deployed in the cloud.
  • a series of processing is performed based on the initial semantic segmentation network structure and image samples to obtain the corresponding trained semantic segmentation network structure, and provide users with the corresponding training Semantic segmentation network structure for subsequent semantic segmentation operations; also various types of user terminals such as notebook computers, tablet computers, desktop computers, mobile devices (e.g., mobile phones, personal digital assistants), such as handheld terminals, according to The initial semantic segmentation network structure and image samples input by the user on the handheld terminal obtain the corresponding trained semantic segmentation network structure, and provide the user with the corresponding trained semantic segmentation network structure for subsequent semantic segmentation operations.
  • FIG. 1 is a schematic diagram of an application scenario of the semantic segmentation network structure generation system 10 provided by an embodiment of the present application.
  • the terminal 200 is connected to the server 100 through the network 300.
  • the network 300 may be a wide area network or a local area network, or two The combination of those.
  • the terminal 200 locally executes the semantic segmentation network structure generation method provided in the embodiments of this application to complete the initial semantic segmentation network structure and image samples input by the user to obtain the trained semantic segmentation network structure, for example, Install the semantic segmentation network structure generation assistant on the terminal 200.
  • the user inputs the initial semantic segmentation network structure and image samples, and the terminal 200 obtains training according to the input initial semantic segmentation network structure and image samples
  • the trained semantic segmentation network structure is displayed on the display interface 210 of the terminal 200, so that the user can perform image recognition, target detection and other applications according to the trained semantic segmentation network structure.
  • the terminal 200 may also send the initial semantic segmentation network structure and image samples input by the user on the terminal 200 to the server 100 via the network 300, and call the semantic segmentation network structure generation function provided by the server 100.
  • the server 100 The semantic segmentation network structure after training is obtained by the semantic segmentation network structure generation method provided in the embodiments of the present application.
  • the semantic segmentation network structure generation assistant is installed on the terminal 200, and the user inputs the initial semantic segmentation network structure generation assistant in the semantic segmentation network structure generation assistant. Semantic segmentation network structure and image samples.
  • the terminal sends the initial semantic segmentation network structure and image samples to the server 100 through the network 300.
  • the server 100 receives the initial semantic segmentation network structure and image samples and performs a series of processing to obtain the trained semantics.
  • Segment the network structure and return the trained semantic segmentation network structure to the semantic segmentation network structure generation assistant, and display the trained semantic segmentation network structure on the display interface 210 of the terminal 200, or the server 100 directly gives the trained semantics Segment the network structure so that users can perform image recognition, target detection and other applications based on the trained semantic segmentation network structure.
  • the server or terminal can optimize the initial semantic segmentation network structure based on the image samples based on the initial semantic segmentation network structure and image samples, and remove redundant units,
  • the improved semantic segmentation network structure the objects existing in the image samples and the corresponding positions of the objects are determined, and based on the objects existing in the image samples and the corresponding positions of the objects, the improved semantic segmentation network structure is trained so that the subsequent training can be based on The latter semantic segmentation network structure performs semantic segmentation on the image and determines the category of the image.
  • the image is semantically segmented according to the trained semantic segmentation network structure to obtain the corresponding label (car, car, bus, etc.) of the image.
  • the redundant calculation unit in the semantic segmentation network structure is removed, saving the calculation amount of subsequent semantic segmentation, and the output of the super unit from which the redundant unit is removed is performed through the aggregation unit.
  • Feature fusion thereby adaptively fusing the output of fused super-units of different resolutions to improve the performance of real-time image segmentation.
  • the server or terminal in order to obtain a semantic segmentation network structure for target detection, can optimize the initial semantic segmentation network structure based on the target object samples based on the initial semantic segmentation network structure and target object samples, and remove redundancy Unit, get the improved semantic segmentation network structure, through the improved semantic segmentation network structure, determine the object existing in the target object sample and the corresponding position of the object, and train based on the object existing in the target object sample and the corresponding position of the object Improved semantic segmentation network structure so that the target object can be semantically segmented according to the trained semantic segmentation network structure to determine the category of the target object.
  • the target object can be semantically segmented according to the trained semantic segmentation network structure to obtain the target The tag corresponding to the object (trees, pedestrians, vehicles, etc.) to detect pedestrians.
  • the redundant calculation unit in the semantic segmentation network structure is removed, saving the calculation amount of subsequent semantic segmentation, and the output of the super unit from which the redundant unit is removed is performed through the aggregation unit.
  • Feature fusion thereby adaptively fusing the output of fusion super-units of different resolutions, and improving the performance of real-time segmentation of target objects.
  • the server or terminal in order to obtain a semantic segmentation network structure for face recognition, can optimize the initial semantic segmentation network structure based on the face sample based on the initial semantic segmentation network structure and face samples, and remove Redundant unit, through the improved semantic segmentation network structure, determines the objects existing in the face sample and the corresponding positions of the objects, and trains the improved semantic segmentation network structure based on the objects existing in the face sample and the corresponding positions of the objects. So that subsequent semantic segmentation of the face can be performed according to the trained semantic segmentation network structure, and the category of the face can be determined, so as to realize face recognition. For example, the semantic segmentation of the face is performed according to the semantic segmentation network structure after the training to obtain the face location.
  • Corresponding tags (Xiao Ming, Xiao Hong, Xiao Qiang, etc.).
  • the server or terminal may based on the initial semantic segmentation network structure and road condition driving samples, optimize the initial semantic segmentation network structure based on the road condition driving samples, and remove redundancy
  • the unit through the aggregation unit in the improved semantic segmentation network structure, determines the objects existing in the road driving samples and the corresponding positions of the objects, and trains the improved semantic segmentation network based on the objects existing in the road driving samples and the corresponding positions of the objects Structure, so that the subsequent semantic segmentation of road conditions can be performed according to the trained semantic segmentation network structure, and the driving category of the road conditions can be determined, so as to realize automatic driving according to the road conditions.
  • the semantic segmentation of the road conditions can be obtained according to the semantic segmentation network structure after training.
  • the corresponding label (turn left, turn right, go straight, etc.).
  • the redundant calculation unit in the semantic segmentation network structure is removed, saving the calculation amount of subsequent semantic segmentation, and the output of the super unit from which the redundant unit is removed is performed through the aggregation unit.
  • Feature fusion thereby adaptively fusing the output of fused super-units of different resolutions to improve the performance of real-time segmentation of road conditions.
  • the server or terminal in order to obtain a semantic segmentation network structure for video editing, can optimize the initial semantic segmentation network structure based on the video editing samples based on the initial semantic segmentation network structure and video editing samples, and remove the redundancy.
  • the remaining units, the improved semantic segmentation network structure is obtained, through the improved semantic segmentation network structure, the objects existing in the video editing sample and the corresponding positions of the objects are determined, based on the objects existing in the video editing sample and the corresponding positions of the objects, training Improved semantic segmentation network structure, so that subsequent semantic segmentation can be performed on the video according to the trained semantic segmentation network structure, and the editing category to which the video belongs can be determined, so as to realize automatic real-time editing according to the video, for example, according to the trained semantic segmentation network structure.
  • Semantic segmentation of the video is performed to obtain the tags corresponding to the video (cropped, reduced, enlarged, etc.).
  • the redundant calculation unit in the semantic segmentation network structure is removed, saving the calculation amount of subsequent semantic segmentation, and the output of the super unit from which the redundant unit is removed is performed through the aggregation unit.
  • Feature fusion thereby adaptively fusing the output of fused super-units of different resolutions, and improving the performance of real-time video segmentation.
  • the electronic device used to generate the semantic segmentation network structure can be various terminals, such as mobile phones, computers, etc., or as shown in Figure 1. Out of the server 100.
  • FIG. 2 is a schematic structural diagram of an electronic device 500 for generating a semantic segmentation network structure provided by an embodiment of the present application.
  • the electronic device 500 for generating a semantic segmentation network structure shown in FIG. 2 includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530.
  • the various components in the electronic device 500 for generating the semantic segmentation network structure are coupled together through the bus system 540.
  • the bus system 540 is used to implement connection and communication between these components.
  • the bus system 540 also includes a power bus, a control bus, and a status signal bus. However, for the sake of clear description, various buses are marked as the bus system 540 in FIG. 2.
  • the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware Components, etc., where the general-purpose processor may be a microprocessor or any conventional processor.
  • DSP Digital Signal Processor
  • the user interface 530 includes one or more output devices 531 that enable the presentation of media content, including one or more speakers and/or one or more visual display screens.
  • the user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, a mouse, a microphone, a touch screen display, a camera, and other input buttons and controls.
  • the memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory.
  • the non-volatile memory may be a read only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • the memory 550 described in the embodiment of the present application is intended to include any suitable type of memory.
  • the memory 550 optionally includes one or more storage devices that are physically remote from the processor 510.
  • the memory 550 can store data to support various operations. Examples of these data include programs, modules, and data structures, or a subset or superset thereof, as illustrated below.
  • the operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • the network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520.
  • Exemplary network interfaces 520 include: Bluetooth, Wireless Compatibility Authentication (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
  • the display module 553 is used to enable the presentation of information via one or more output devices 531 (for example, a display screen, a speaker, etc.) associated with the user interface 530 (for example, a user interface for operating peripheral devices and displaying content and information) );
  • output devices 531 for example, a display screen, a speaker, etc.
  • user interface 530 for example, a user interface for operating peripheral devices and displaying content and information
  • the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.
  • the apparatus for generating a semantic segmentation network structure provided by an embodiment of the present application may be implemented by a combination of software and hardware.
  • the apparatus for generating a semantic segmentation network structure provided by an embodiment of the present application may be implemented by hardware decoding.
  • a processor in the form of a hardware decoding processor may adopt one or more application-specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or others Electronic component.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal processor
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the semantic segmentation network structure generation device provided in the embodiments of the present application can be implemented in software.
  • FIG. 2 shows the semantic segmentation network structure generation device 555 stored in the memory 550, which can be a program.
  • plug-ins and other forms of software and includes a series of modules, including add module 5551, remove module 5552, fusion module 5553, training module 5554, and merge module 5555; among them, add module 5551, remove module 5552, fusion module 5553, The training module 5554 and the merging module 5555 are used to implement the method for generating the semantic segmentation network structure provided in the embodiment of the present application.
  • the method for generating the semantic segmentation network structure provided by the embodiments of the present application can be implemented by various types of electronic devices for generating the semantic segmentation network structure, such as smart terminals and servers.
  • FIG. 3 is a schematic flowchart of a method for generating a semantic segmentation network structure provided by an embodiment of the present application, and is described in conjunction with the steps shown in FIG. 3.
  • the super unit is composed of units of the same stage (resolution). For example, the resolution of the k-1 unit and the k-th unit is 128*128, then the k-1 unit and the k-th unit form a super unit .
  • the aggregation unit is used for feature fusion to adaptively fuse features at different scales.
  • step 101 corresponding architecture parameters are generated for each unit constituting the super unit in the semantic segmentation network structure.
  • the user enters the initial semantic segmentation network structure and image samples in the client (running on the terminal), and the terminal automatically obtains the generation request for the semantic segmentation network structure (including the initial semantic segmentation network structure) , And send the generation request for the semantic segmentation network structure to the server, and the server receives the generation request for the semantic segmentation network structure, and extracts the semantic segmentation network structure. Then, in order to further remove redundant units in the super unit in the subsequent, it is possible to first add corresponding architecture parameters for each unit constituting the super unit in the semantic segmentation network structure.
  • FIG. 4 is an optional flowchart provided by an embodiment of the present application. In some embodiments, FIG. 4 shows that before step 101 in FIG. 3, step 106 is further included.
  • step 106 the units of the same resolution in the semantic segmentation network structure are merged into super units; the structure of the aggregation unit is determined according to the number of super units.
  • the super unit is composed of units of the same resolution in the semantic segmentation network structure. After the units are merged into super-units, the number of super-units is determined, and the structure of the aggregation unit in the semantic segmentation network structure is determined according to the number of super-units, so that subsequent aggregation units can perform down-sampling operations.
  • determining the structure of the aggregation unit according to the number of super units includes: determining the number N of super units; determining the number of down-sampling units corresponding to the i-th super unit in the aggregation unit as Ni; where N is A positive integer greater than or equal to 2, i is a positive integer, and i is less than or equal to N.
  • the number of down-sampling units corresponding to the N super-units in the aggregation unit is sequentially determined. For example, when the number of super-units is determined to be 3, it is determined that the number of down-sampling units corresponding to the first super-unit in the aggregation unit is 2, and the number of down-sampling units corresponding to the second super-unit is 1, and the number of down-sampling units corresponding to the third super-unit is 1.
  • the number of down-sampling units is 0; when the number of super-units is determined to be 4, it is determined that the number of down-sampling units corresponding to the first super-unit in the aggregation unit is 3, and the number of down-sampling units corresponding to the second super-unit is 2.
  • the number of down-sampling units corresponding to the third super unit is 1 and the number of down-sampling units corresponding to the fourth super unit is 0.
  • the general number of super-units is 3 or 4.
  • adding corresponding architectural parameters to each unit constituting the super unit in the semantic segmentation network structure includes: determining the number M of units constituting the super unit in the semantic segmentation network structure; for the super unit in the semantic segmentation network structure Each unit of, generates corresponding architecture parameters with a value of 1/M.
  • the initial architecture parameters are determined according to the number of units of the super unit. Therefore, after determining the number M of the units constituting the super unit in the semantic segmentation network structure, the corresponding architecture parameters with a value of 1/M can be generated for each unit constituting the super unit in the semantic segmentation network structure. For example, if it is determined that the number of units constituting the super unit in the semantic segmentation network structure is 10, then for each unit constituting the super unit in the semantic segmentation network structure, the corresponding initial architecture parameter is 0.1.
  • step 102 the semantic segmentation network structure is optimized based on the image samples, and redundant units in the super-unit to which the target unit belongs are removed to obtain an improved semantic segmentation network structure.
  • the target unit is the unit with the largest architecture parameter among the units.
  • it is necessary to optimize the semantic segmentation network structure based on the image sample, and remove the redundant unit in the super-unit of the target unit corresponding to the maximum value of the optimized architecture parameter, so as to realize the dynamic adjustment of the semantic segmentation network structure and reduce the semantics. Segment the depth of the network structure.
  • optimizing the semantic segmentation network structure based on the image samples includes: performing joint training on the self parameters of the semantic segmentation network structure, the operating parameters of each unit, and the architecture parameters based on the image samples to determine the maximum architecture parameters obtained by the training. And determine the unit corresponding to the largest architecture parameter as the target unit.
  • the maximum value of the architecture parameters it is necessary to perform rough training on the semantic segmentation network structure based on image samples.
  • the rough training process is based on the image samples and the self-parameters of the semantic segmentation network structure .
  • the operating parameters and architecture parameters of each unit are jointly trained, and after the maximum architecture parameters obtained by the training are determined, the unit corresponding to the maximum architecture parameters is determined as the target unit.
  • the operating parameters of the unit can be operations such as pooling operations, convolution operations, and identity mapping.
  • determining the maximum architecture parameter obtained by training and determining the unit corresponding to the maximum architecture parameter as the target unit includes: determining the unit corresponding to the architecture parameter 1 obtained by the training as the target unit.
  • the unit corresponding to the architecture parameter of 1 is determined as the target unit, so as to subsequently remove the redundant unit in the super unit.
  • the initial architecture parameter of each unit is 0.1.
  • the architecture parameter of the 4th unit is transformed to 1, and the 4th unit is the target unit.
  • removing redundant units in the super-unit to which the target unit corresponding to the maximum value of the architecture parameter belongs to obtain an improved semantic segmentation network structure includes: determining the order j of the target unit in the super-unit to which it belongs, and removing The redundant unit in the super unit after sorting j; according to the super unit and the aggregation unit after the redundant unit is removed, an improved semantic segmentation network structure is constructed.
  • the order of the target unit in the super unit to which it belongs can be determined, and the redundant unit after the sequence in the super unit can be removed, so as to construct an improved super unit based on the super unit after the redundant unit and the aggregation unit
  • the semantic segmentation network structure makes the improved semantic segmentation network structure have no redundant units, realizes the dynamic adjustment of the semantic segmentation network structure, and reduces the depth of the semantic segmentation network structure. For example, if the first super unit contains 10 units, and the architecture parameter of the 4th unit is changed to 1, the redundant unit after the 4th unit in the first super unit is removed, that is, the unit 5-unit in the first super unit is removed 10.
  • the first supercell after removing the redundant cells only contains cell 1 to cell 4.
  • the output of the super unit in the coarse training process is the weighted sum of the output of the unit in the super unit and the architecture parameters of each unit.
  • f(x k ) represents the output feature of the kth unit in the super unit
  • ⁇ k represents the kth unit.
  • the architecture parameters of each unit, n represents the number of units, then the output of the super unit is
  • step 103 through the aggregation unit in the improved semantic segmentation network structure, feature fusion is performed on the output of the super unit from which the redundant unit is removed, to obtain a fused feature map.
  • the aggregated units in the network structure can be semantically segmented, and the output of the super-units from which redundant units have been removed can be feature-fused to obtain the fused feature map for subsequent based on the fused
  • the feature map performs fine training on the improved semantic segmentation network structure, so as to obtain the trained semantic segmentation network structure, which is convenient for subsequent semantic segmentation of the image.
  • FIG. 5 is an optional flowchart provided by an embodiment of the present application.
  • FIG. 5 shows that step 103 in FIG. 3 can be implemented through step 1031 to step 1033 shown in FIG.
  • step 1031 through the down-sampling unit of the super unit in the improved semantic segmentation network structure, the input feature map is down-sampled to obtain the feature map of the corresponding super unit.
  • the improved semantic segmentation network After rough training is used to determine the maximum value of the architecture parameter in the unit, and remove the redundant unit in the super unit of the target unit corresponding to the maximum value of the architecture parameter, after obtaining the improved semantic segmentation network structure, the improved semantic segmentation network can be passed first
  • the down-sampling unit of the super unit in the structure performs down-sampling processing on the input feature map to obtain the feature map of the corresponding super unit.
  • the super unit includes a down-sampling unit and a normal unit.
  • the down-sampling unit has a step size of 2 to realize the down-sampling function, while the normal unit has a step size of 1, which cannot realize the down-sampling function.
  • the input image to the improved semantic segmentation network structure first passes through a layer of down-sampled convolutional neural network, and then sequentially passes through three consecutive super units, each of which is super unit
  • the first unit of the unit is a down-sampling unit, and the remaining units are normal units.
  • the input feature map is down-sampled through the down-sampling unit of each super unit to obtain the feature map corresponding to the super unit and input to the next Super unit or aggregation unit.
  • the input feature map is down-sampled by the down-sampling unit of the super unit in the improved semantic segmentation network structure to obtain the feature map of the corresponding super unit, including: determining that the improved semantic segmentation network structure removes The i-th super unit of the redundant unit; down-sampling the input feature map through the i-th super unit to obtain the feature map corresponding to the i-th super unit.
  • the down-sampling unit of the super-unit in the improved semantic segmentation network structure In order to down-sampling the input feature map through the down-sampling unit of the super-unit in the improved semantic segmentation network structure, first determine the i-th super-unit in the improved semantic segmentation network structure that removes redundant units, and then pass the i-th super-unit in the improved semantic segmentation network structure.
  • the super unit performs down-sampling processing on the input feature map to obtain the feature map corresponding to the i-th super unit, and input it to the next super unit or aggregation unit.
  • the input feature map is down-sampled through the first super unit Process to obtain the feature map corresponding to the first super unit, and input the feature map corresponding to the first super unit to the first super unit and the aggregation unit; when it is determined that the second super unit that removes redundant units in the improved semantic segmentation network structure Unit, through the second super unit to down-sample the input feature map corresponding to the first super unit to obtain the feature map corresponding to the second super unit, and input the feature map corresponding to the second super unit to the third super unit and Aggregation unit; when it is determined that the third super unit in the improved semantic segmentation network structure that removes redundant units, the input feature map corresponding to the second super unit is down-sampled through the third super unit to obtain the corresponding third super unit And input the feature map corresponding to the third super unit to the aggregation unit.
  • step 1032 the down-sampling unit in the aggregation unit sequentially performs down-sampling processing on the feature maps output by the super unit from which the redundant unit is removed, to obtain multiple feature maps of the same resolution corresponding to the super unit.
  • the down-sampling unit in the aggregation unit can sequentially down-sample the output feature map of the super-unit from the redundant unit to obtain Corresponding to the feature maps of the same resolution of the super unit, so that the feature maps of the same resolution can be subsequently fused.
  • the down-sampling unit in the aggregation unit sequentially performs down-sampling processing on the feature maps output by the super-units from which redundant units are removed to obtain multiple feature maps with the same resolution corresponding to the super-units, including: Through Ni down-sampling units in the aggregation unit, the i-th super-unit is subjected to Ni-time down-sampling processing to obtain the down-sampling feature map corresponding to the i-th super-unit.
  • the resolutions of the down-sampling feature maps of the N super-units are the same.
  • the down-sampling unit in the aggregation unit can be used to down-sample the super-units to obtain the down-sampling feature maps of the corresponding super-units, that is, through the Ni in the aggregation unit.
  • Each down-sampling unit performs Ni down-sampling processing on the i-th super-unit to obtain the down-sampling feature map corresponding to the i-th super-unit.
  • the feature map output by the first super-unit is down-sampled twice through the second down-sampling unit in the aggregation unit to obtain the corresponding first super-unit
  • the down-sampling feature map of the second super-unit is processed by one down-sampling unit in the aggregation unit, and the down-sampling feature map corresponding to the second super-unit is obtained;
  • the third super-unit is not output Perform down-sampling processing on the feature map of the aggregation unit, and perform operations other than down-sampling on the third super-unit through the normal unit in the aggregation unit to obtain the normal feature map corresponding to the third super-unit.
  • the down-sampling feature map corresponding to the first super unit, the down-sampling feature map corresponding to the second super unit, and the normal feature map corresponding to the third super unit can be input to the normal unit in the aggregation unit to perform the down-sampling again. Other operations for subsequent fusion processing.
  • step 1033 fusion processing is performed on multiple feature maps of the same resolution to obtain a fused feature map.
  • the feature maps of the same resolution can be fused to obtain the fused feature maps for subsequent semantic segmentation processing.
  • the fusion processing can be a splicing process, that is, the down-sampling feature maps of the corresponding super-units are sequentially spliced to obtain a fused feature map.
  • step 104 a recognition process is performed on the fused feature map, and the position corresponding to the object existing in the image sample is determined.
  • the fusion feature map can be recognized to determine the object existing in the image sample and the corresponding position of the object, so that it can be subsequently based on The objects existing in the image samples and their corresponding positions train the improved semantic segmentation network structure.
  • FIG. 6 is an optional flowchart diagram provided by an embodiment of the present application.
  • FIG. 6 shows that step 104 in FIG. 3 can be implemented through steps 1041 to 1042 shown in FIG. 6.
  • step 1041 feature mapping is performed on the fused feature map to obtain a mapped feature map of the corresponding image sample.
  • the fused feature map is a low-resolution feature map
  • the low-resolution feature map needs to be feature-mapped, and the low-resolution feature map is mapped to the pixel-level feature map, which can generate feature-intensive features through upsampling Feature map. Therefore, feature mapping can be performed on the fused feature map to obtain the mapped feature map of the corresponding image sample for subsequent identification processing.
  • step 1042 a recognition process is performed on the mapped feature map of the corresponding image sample, and the position corresponding to the object existing in the image sample is determined.
  • the mapped feature map of the corresponding image sample is identified according to the semantic segmentation method, so as to determine the object existing in the image sample and the corresponding position of the object, so as to follow the object existing in the image sample And the corresponding position of the object to train the improved semantic segmentation network structure.
  • step 105 the improved semantic segmentation network structure is trained based on the position corresponding to the object existing in the image sample and the label corresponding to the image sample to obtain the trained semantic segmentation network structure.
  • the annotations corresponding to the image samples can be obtained, where the annotations corresponding to the image samples are the objects existing in the image samples manually annotated by the user and the corresponding positions of the objects.
  • the improved objects can be based on the objects and the positions corresponding to the objects in the image samples, and the annotations corresponding to the image samples.
  • the semantic segmentation network structure is iteratively trained to generate a trained semantic segmentation network structure, so that other images can be semantically segmented subsequently through the trained semantic segmentation network structure.
  • training the improved semantic segmentation network structure based on the objects existing in the image samples and the positions corresponding to the objects, and the annotations corresponding to the image samples includes: based on the positions corresponding to the objects existing in the image samples , And the annotations corresponding to the image samples, construct the loss function of the improved semantic segmentation network structure; update the self parameters of the improved semantic segmentation network structure until the loss function converges.
  • the server constructs the value of the loss function of the improved semantic segmentation network structure based on the position corresponding to the object in the image sample and the label corresponding to the image sample, it can determine whether the value of the loss function exceeds the preset threshold.
  • the error signal of the improved semantic segmentation network structure is determined based on the loss function, the error information is backpropagated in the improved semantic segmentation network structure, and the information of each layer is updated during the propagation process.
  • the training sample data is input to the input layer of the neural network model, passes through the hidden layer, and finally reaches the output layer and outputs the result.
  • This is the forward propagation process of the neural network model. If there is an error between the output result and the actual result, the error between the output result and the actual value is calculated, and the error is propagated back from the output layer to the hidden layer until it propagates to the input layer.
  • the error Adjust the value of the model parameters; continue to iterate the above process until convergence, where the semantic segmentation network structure belongs to the neural network model.
  • Adding module 5551 configured to generate corresponding architecture parameters for each unit of the super-unit in the semantic segmentation network structure; removing module 5552, configured to optimize the semantic segmentation network structure based on image samples, and remove the target corresponding to the maximum value of the architecture parameter
  • the redundant unit in the super unit to which the unit belongs obtains an improved semantic segmentation network structure; the fusion module 5553 is configured to use the aggregate unit in the improved semantic segmentation network structure to correct the super unit from which the redundant unit is removed.
  • the output of the unit performs feature fusion to obtain a fused feature map;
  • the training module 5554 is configured to perform recognition processing on the fused feature map, and determine the object existing in the image sample and the position corresponding to the object; based on The positions corresponding to the objects existing in the image samples and the annotations corresponding to the image samples are trained on the improved semantic segmentation network structure to obtain the trained semantic segmentation network structure.
  • the generating device 555 of the semantic segmentation network structure further includes: a merging module 5555 configured to merge units of the same resolution in the semantic segmentation network structure into the super unit; according to the super unit Determine the structure of the aggregation unit.
  • the merging module 5555 is further configured to determine the number of super-units N; determine the number of down-sampling units corresponding to the i-th super-unit in the aggregation unit as Ni; where N is greater than or A positive integer equal to 2, i is a positive integer, and i is less than or equal to N.
  • the adding module 5551 is further configured to determine the number M of units composing the super unit in the semantic segmentation network structure; for each unit composing the super unit in the semantic segmentation network structure, Generate the corresponding architecture parameter with a value of 1/M.
  • the removal module 5552 is further configured to perform joint training on the self-parameters of the semantic segmentation network structure, the operating parameters of the units, and the architecture parameters based on the image samples, and determine the maximum training result The maximum value of the architecture parameter is determined, and the unit corresponding to the maximum architecture parameter is determined as the target unit.
  • the removal module 5552 is further configured to determine the unit corresponding to the architecture parameter 1 obtained by training as the target unit.
  • the removal module 5552 is further configured to determine the order j of the target unit in the super unit to which it belongs, and remove redundant units in the super unit after the order j; according to the removal of redundant units
  • the subsequent super unit and the aggregation unit construct an improved semantic segmentation network structure.
  • the fusion module 5553 is further configured to perform down-sampling processing on the input feature map through the down-sampling unit of the super-unit in the improved semantic segmentation network structure to obtain a feature map corresponding to the super-unit ; Through the down-sampling unit in the aggregation unit, sequentially down-sampling the feature maps output by the super unit from which the redundant unit is removed, to obtain multiple feature maps of the same resolution corresponding to the super unit; The multiple feature maps with the same resolution are fused to obtain a fused feature map.
  • the fusion module 5553 is further configured to determine the i-th super unit from which the redundant unit is removed in the improved semantic segmentation network structure; and perform processing on the input feature map through the i-th super unit Down-sampling processing to obtain a feature map corresponding to the i-th super unit; Ni down-sampling processing is performed on the i-th super unit through Ni down-sampling units in the aggregation unit to obtain the corresponding i-th super unit
  • the down-sampling feature maps of the N super-units have the same resolution.
  • the training module 5554 is further configured to perform feature mapping on the fused feature map to obtain a mapped feature map corresponding to the image sample; perform a feature map on the mapped feature map corresponding to the image sample Recognition processing to determine the object existing in the image sample and the position corresponding to the object; based on the position corresponding to the object existing in the image sample and the annotation corresponding to the image sample, construct the improved semantics Segmenting the loss function of the network structure; updating the self-parameters of the improved semantic segmentation network structure until the loss function converges.
  • FIG. 7 is an implementation of the present application.
  • the terminal 200 is connected to the server 100 through a network 300.
  • the network 300 may be a wide area network or a local area network, or a combination of the two.
  • the terminal 200 locally executes the semantic segmentation method of the image provided in the embodiments of the present application to complete the image to be semantically segmented according to the user input, and obtain the object existing in the image to be semantically segmented and the corresponding position of the object, for example , Install the semantic segmentation assistant on the terminal 200, the user inputs the image to be semantically segmented in the semantic segmentation assistant, and the terminal 200 obtains the object existing in the image to be semantically segmented and the corresponding position of the object according to the input image to be semantically segmented , And display the object existing in the image to be semantically segmented and the corresponding position of the object on the display interface 210 of the terminal 200.
  • the terminal 200 may also send the image to be semantically segmented input by the user on the terminal 200 to the server 100 through the network 300, and call the semantic segmentation function of the image provided by the server 100.
  • the server 100 is provided by the embodiments of the present application.
  • the semantic segmentation method of the image to obtain the object existing in the image to be semantically segmented and the corresponding position of the object, for example, the semantic segmentation assistant is installed on the terminal 200, the user inputs the image to be semantically segmented in the semantic segmentation assistant, and the terminal uses the network 300 sends the image to be semantically segmented to the server 100.
  • the server 100 After receiving the image to be semantically segmented, the server 100 performs recognition processing on the image to be semantically segmented to obtain the objects existing in the image to be semantically segmented and the corresponding positions of the objects , And return the object and the corresponding position of the object in the image to be semantically segmented to the semantic segmentation assistant, and display the object and the position corresponding to the object in the image to be semantically segmented on the display interface 210 of the terminal 200, or the server 100 directly gives the object existing in the image to be semantically segmented and the corresponding position of the object.
  • FIG. 8 is a schematic structural diagram of an electronic device 600 for semantic segmentation of an image provided by an embodiment of the present application.
  • the electronic device 600 for semantic segmentation of an image shown in FIG. 8 includes: at least one processor 610, A storage 650, at least one network interface 620, and a user interface 630.
  • the functions of the processor 610, the memory 650, the at least one network interface 620, and the user interface 630 are respectively similar to those of the processor 510, the memory 550, the at least one network interface 520, and the user interface 530, namely, the output device 631 and the input device 632.
  • the functions of the operating system 651, the network communication module 652, the display module 653, and the input processing module 654 are similar to those of the output device 531 and the input device 532.
  • the functions of the operating system 551, the network communication module 552, the display module 553, and the input processing The function of the module 554 is similar and will not be repeated.
  • the image semantic segmentation device provided by the embodiments of the present application can be implemented in software.
  • FIG. 8 shows the image semantic segmentation device 655 stored in the memory 650, which can be in the form of programs, plug-ins, etc.
  • the software includes a series of modules, including a determining module 6551 and a processing module 6552; wherein, the determining module 6551 and the processing module 6552 are used to implement the image semantic segmentation method provided by the embodiment of the present application.
  • image semantic segmentation method provided in the embodiments of the present application can be implemented by various types of electronic devices used for image semantic segmentation, such as smart terminals and servers.
  • FIG. 9 is a schematic flowchart of a method for semantic segmentation of an image provided by an embodiment of the present application, and is described with reference to the steps shown in FIG. 9.
  • step 201 the image to be semantically segmented is determined.
  • the terminal can also send the image to be semantically segmented input by the user on the terminal to the server via the network.
  • the server After the server receives the image to be semantically segmented, it can Determine the image to be semantically segmented for semantic segmentation.
  • step 202 the image to be semantically segmented is recognized through the trained semantic segmentation network structure, the objects existing in the image to be semantically segmented and the positions corresponding to the objects are determined, and the objects and the corresponding positions of the objects are labeled by a preset labeling method. position.
  • the image with segmentation is down-sampled, and the output of the super unit is feature fused through the aggregation unit to obtain the fused feature map, and then the fusion is Perform recognition processing on the feature map of the image to determine the object and the corresponding position of the object in the image to be semantically segmented, and mark the object and the corresponding position of the object through a preset annotation method, so that the user can view the semantically segmented image.
  • the preset labeling method can be to label different objects with different colors, to label the objects existing in the image to be semantically segmented by labeling boxes, or to label the dashed box along the edges of the objects.
  • the preset labeling method of the application embodiment is not limited to the above labeling method.
  • performing feature fusion on the output of the super unit by the aggregation unit to obtain the fused feature map includes: performing down-sampling processing on the input feature map by the down-sampling unit of the super unit to obtain the feature of the corresponding super unit Figure; Through the down-sampling unit in the aggregation unit, the feature maps output by the super unit are sequentially down-sampled to obtain multiple feature maps of the same resolution corresponding to the super unit; multiple feature maps of the same resolution are fused , Get the feature map after fusion.
  • the down-sampling unit of the super unit performs down-sampling processing on the input feature map to obtain the feature map of the corresponding super unit, including: determining the i-th super unit; performing the input feature map through the i-th super unit Down-sampling processing to obtain a feature map corresponding to the i-th super unit;
  • the feature maps output by the super-units are sequentially down-sampled to obtain feature maps of the same resolution corresponding to the super-units, including: pairing Ni down-sampling units in the aggregation unit
  • the i-th super-unit performs Ni downsampling processing to obtain the down-sampling feature map corresponding to the i-th super-unit; wherein the resolutions of the down-sampling feature maps corresponding to the N super-units are the same.
  • performing recognition processing on the fused feature map to determine the object existing in the image to be semantically segmented and the corresponding position of the object includes: performing feature mapping on the fused feature map to obtain the corresponding semantic segmentation The mapping feature map of the image to be semantically segmented; the mapping feature map corresponding to the image to be semantically segmented is identified, and the object existing in the image to be semantically segmented and the corresponding position of the object are determined.
  • the determining module 6551 is configured to determine the image to be semantically segmented
  • the processing module 6552 is configured to recognize the image to be semantically segmented through the trained semantic segmentation network structure, determine the object and the position corresponding to the object in the image to be semantically segmented, and label the object and the object corresponding to the object through a preset labeling method s position.
  • the embodiment of the present application also provides a computer-readable storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor will cause the processor to execute the semantic segmentation provided by the embodiments of the present application.
  • a method for generating a network structure for example, the method for generating a semantic segmentation network structure as shown in Figs. 3-6, or the method for semantic segmentation of an image provided in an embodiment of the present application, for example, the method for semantic segmentation of an image as shown in Fig. 9 .
  • the computer-readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories.
  • Various equipment may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories.
  • Various equipment may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories.
  • Various equipment may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories.
  • the executable instructions may be in the form of programs, software, software modules, scripts or codes, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and their It can be deployed in any form, including being deployed as an independent program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may but do not necessarily correspond to files in the file system, and may be stored as part of files that store other programs or data, for example, in a HyperText Markup Language (HTML, HyperText Markup Language) document
  • HTML HyperText Markup Language
  • One or more scripts in are stored in a single file dedicated to the program in question, or in multiple coordinated files (for example, a file storing one or more modules, subroutines, or code parts).
  • executable instructions can be deployed to be executed on one computing device, or on multiple computing devices located in one location, or on multiple computing devices that are distributed in multiple locations and interconnected by a communication network Executed on.
  • an embodiment of the present application proposes a method for generating a semantic segmentation network structure, which constructs multiple super-units according to different down-sampling stages from the original unit, and introduces the super-units.
  • Unit architecture parameters adaptively adjust the number of units at each stage based on the architecture parameters; at the same time, build aggregation units for the aggregation of contextual features in image segmentation to better integrate features between different scales.
  • This method can generate a real-time high frame per second (FPS, Frames Per Second) and high-performance semantic segmentation network in semantic segmentation, which is used in real-time fields such as autonomous driving and mobile phones.
  • FPS Frames Per Second
  • FIG. 10 is a schematic diagram of the super-unit structure provided by an embodiment of the present application.
  • the original unit structure is divided into multiple super-units according to different down-sampling stages (resolutions).
  • the unit structures of the same resolution size belong to the same type.
  • Super unit introduces architectural parameters between units; in addition, for semantic segmentation, feature fusion of different scales is also very important, for high-resolution spatial information (the output of the previous super unit) and low-resolution semantic information (The output of the latter super unit), by establishing the aggregation unit, it is possible to effectively adapt the fusion features and improve the performance of the real-time segmentation network.
  • the implementation of the embodiment of this application is divided into three stages, namely: 1) unit-level pruning; 2) aggregation unit search; 3) network pre-training and retraining.
  • f(x k ) represents the output characteristics of the k-th unit in the super cell
  • ⁇ k represents the architecture parameters of the k-th unit
  • n represents the number of cells
  • Output_ ⁇ super_cell ⁇ represents the output of the super cell.
  • the sampling method of the architecture parameter ⁇ k is the category variational autoencoder (Gumbel Softmax), and the Gumbel Softmax is optimized for one-hot encoding during the training process.
  • the unit-level architecture parameter ⁇ k the unit-level architecture parameter ⁇ k
  • the network The own parameters and the original architecture parameter ⁇ (any type of operation in the candidate operation set) in the unit are jointly trained, and these three parameters are updated in the same round of back propagation.
  • the maximum value parameter is shown in phantom architecture ⁇ 3, then the cell behind it (2-4 units) will be discarded, thereby realizing dynamic adjustment of the depth of the network.
  • FIG. 11 is a schematic diagram of the unit structure provided by an embodiment of the present application.
  • the unit in the embodiment of the present application is composed of at least one node in the neural network, and the unit in the embodiment of the present application may be composed of two nodes.
  • Intermediate node 1 and intermediate node 2 constitute, for example, input the output results of the k-1th unit and the k-2th unit to the intermediate node 1 in the k unit, and the intermediate node 1 will output the output of the intermediate node 1 after the operation process The result is input to the intermediate node 2.
  • the output result of the intermediate node 2 is input to the k+1 unit and the k+2 unit.
  • the solid line represents any operation in the candidate operation set, and the dashed line Indicates output.
  • Figure 12 is a schematic diagram of the semantic segmentation network structure provided by an embodiment of the present application, where the input is an image, and it first passes through a layer of convolutional neural network (CNN) downsampled to 2, and then passes through three consecutive Super cells, the first cell in each super cell is a reduction cell, and the remaining cells are normal cells.
  • CNN convolutional neural network
  • the entire semantic segmentation network structure downsamples the image by 16 times.
  • the last three super cells are The output is merged and input to the aggregation unit.
  • the candidate operation set of the downsampling unit (Reduction Cell) and the normal cell (Normal Cell) can be composed of average pooling layer, maximum pooling layer, 1x1 convolutional layer, identity mapping, 3x3 convolutional layer, and 3x3 hole convolution. , 5x5 hole convolution, 3x3 group convolution, etc., that is, the down-sampling unit and the normal unit can be composed of any one operation in the candidate operation set.
  • the embodiment of this application uses aggregation units to fuse feature maps of different resolutions, and combines the low-level spatial features (the output of the front super unit, such as super unit 1) and the deep semantic information (the output of the back super unit). For example, the super unit 3) is fused.
  • the output of the three super units needs to be down-sampled 2, 1, and 0 times to achieve the same size of the feature map, as shown in FIG. 13, which is an example of this application.
  • the aggregation unit consists of a total of 5 units, of which there are two reduction cells for super cell 1, one for super cell 2 and one for super cell 3.
  • Normal Cell and input the output of down-sampling unit 2, down-sampling unit 3, and normal cell 1 to normal cell 2, that is, after processing the output features of the three super cells into the same size, then perform feature splicing to be effective
  • the self-adaptive fusion feature improves the performance of the real-time segmentation network.
  • the set of candidate operations for the aggregation unit can be composed of average pooling layer, maximum pooling layer, 1x1 convolutional layer, identity mapping, 3x3 convolutional layer, 3x3 hole convolution, 3x3 group convolution and channel attention mechanism layer, space
  • the attention mechanism layer, etc. that is, the unit in the aggregation unit can be composed of any operation in the set of candidate operations.
  • a complete neural network structure (improved semantic segmentation network structure) can be obtained, and the ImageNet data set can be used for pre-training, which improves the generalization ability of the network structure and improves the generalization ability of the network structure.
  • the image to be semantically segmented can be recognized through the semantic segmentation network structure after the training according to the image input by the user to determine the image to be semantically segmented
  • the existing objects and the corresponding positions of the objects are marked in the preset labeling method, and the semantically segmented images are obtained.
  • the original unit can be divided into super-units according to different down-sampling stages, and differentiable super-unit-level architecture parameters can be introduced, and each stage can be adaptively adjusted through unit-level tailoring (super-units).
  • the aggregation unit can adaptively fuse the features of different scales, so that a more efficient semantic segmentation network structure can be generated.
  • the embodiment of the present application removes the redundant unit in the super unit of the target unit corresponding to the maximum value of the architecture parameter and performs feature fusion on the output of the super unit from which the redundant unit is removed, and has the following beneficial effects:
  • the redundant calculation unit in the semantic segmentation network structure is removed, which saves the calculation amount of subsequent semantic segmentation, realizes the dynamic adjustment of the semantic segmentation network structure, and reduces the depth of the semantic segmentation network structure;
  • the position of the down-sampling unit in the super-unit in the semantic segmentation network structure is determined, so as to perform down-sampling at a suitable position; the super-units of the removed redundant units are analyzed by the aggregation unit.
  • the output of the unit is feature fused, thereby adaptively fusing the output of different resolution super-units to improve the performance of real-time segmentation, which is suitable for various semantic segmentation application scenarios.
  • the electronic device optimizes the semantic segmentation network structure based on image samples to remove redundant units, and performs feature fusion on the output of the super-units from which redundant units have been removed through the aggregate unit in the improved semantic segmentation network structure , Recognize the fused feature map to determine the position corresponding to the object in the image sample, and train the improved semantic segmentation network structure based on the position corresponding to the object in the image sample to generate the trained semantic segmentation network structure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种语义分割网络结构的生成方法、装置、电子设备及计算机可读存储介质;方法包括:为语义分割网络结构中组成超单元的各单元生成对应的架构参数(101);基于图像样本优化语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构(102);通过改进后的语义分割网络结构中的聚集单元,对被去除冗余单元的超单元的输出进行特征融合,得到融合后的特征图(103);对融合后的特征图进行识别处理,确定图像样本中存在的物体对应的位置(104);基于图像样本中存在的物体对应的位置、以及图像样本所对应的标注,对改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构(105)。

Description

语义分割网络结构的生成方法、装置、设备及存储介质
相关申请的交叉引用
本申请基于申请号为201911102046.3、申请日为2019年11月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能技术,涉及一种语义分割网络结构的生成方法、装置、电子设备及计算机可读存储介质。
背景技术
人工智能(Artificial Intelligence,AI)是计算机科学的一个综合技术,通过研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,例如自然语言处理技术以及机器学习/深度学习等几大方向,随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
语义分割是人工智能领域的重要应用之一,在自动驾驶、实时视频编辑、人脸识别系统、智能硬件等中都有广泛的应用,即语义分割是这些复杂系统的基础组件。
但是,目前语义分割网络结构比较单一、固定,且通过固定的语义分割网络结构,无法实时识别图像中存在的内容以及对应的位置。
发明内容
本申请实施例提供一种语义分割网络结构的生成方法、装置、电子设备及计算机可读存储介质,能够动态调整语义分割网络结构,从而提高实 时分割的性能。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种语义分割网络结构的生成方法,所述方法由电子设备执行,所述语义分割网络结构包括超单元以及聚集单元,所述方法包括:
为所述语义分割网络结构中组成所述超单元的各单元生成对应的架构参数;
基于图像样本优化所述语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,其中,所述目标单元为所述各单元中具有最大架构参数的单元;
通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图;
对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体对应的位置;
基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
本申请实施例提供一种图像的语义分割方法,所述方法由电子设备执行,应用于所述训练后的语义分割网络结构;
所述方法包括:
确定待语义分割的图像;
通过所述训练后的语义分割网络结构对所述待语义分割的图像进行识别处理,确定所述待语义分割的图像中存在的物体以及所述物体对应的位置,并
通过预设标注方式标注所述物体以及所述物体对应的位置。
本申请实施例提供一种语义分割网络结构的生成装置,所述装置包括:
添加模块,配置为为语义分割网络结构中组成超单元的各单元生成对应的架构参数;
去除模块,配置为基于图像样本优化所述语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,其中,所述目标单元为所述各单元中具有最大架构参数的单元;
融合模块,配置为通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图;
训练模块,配置为对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体对应的位置;基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
本申请实施例提供一种图像的语义分割装置,所述装置包括:
确定模块,配置为确定待语义分割的图像;
处理模块,配置为通过训练后的语义分割网络结构对所述待语义分割的图像进行识别处理,确定所述待语义分割的图像中存在的物体以及所述物体对应的位置,并通过预设标注方式标注所述物体以及所述物体对应的位置。
本申请实施例提供一种用于生成语义分割网络结构的电子设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的语义分割网络结构的生成方法。
本申请实施例提供一种用于图像的语义分割的电子设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的图像的语义分割方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的语义分割网络结构的生成方法,或者实现本申请实施例提供的图像的语义分割方法。
本申请实施例具有以下有益效果:
通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网络结构中的冗余计算单元,节省后续语义分割的计算量,实现动态调整语义分割网络结构,降低语义分割网络结构的深度;通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率超单元的输出进行自适应融合,提高实时分割的性能。
附图说明
图1是本申请实施例提供的语义分割网络结构的生成系统10的应用场景示意图;
图2是本申请实施例提供的用于生成语义分割网络结构的电子设备500的结构示意图;
图3-6是本申请实施例提供的语义分割网络结构的生成方法的流程示意图;
图7是本申请实施例提供的图像的语义分割系统20的应用场景示意图;
图8是本申请实施例提供的用于图像的语义分割的电子设备600的结构示意图;
图9是本申请实施例提供的图像的语义分割方法的流程示意图;
图10是本申请实施例提供的超单元结构示意图;
图11是本申请实施例提供的单元结构示意图;
图12是本申请实施例提供的语义分割网络结构的示意图;
图13是本申请实施例提供的聚集单元的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)图像识别:利用计算机对图像进行处理、分析和理解,以识别各种不同模式的目标和对象的技术,是应用深度学习算法的一种实践应用。图像识别技术一般分为人脸识别与物品识别,人脸识别主要运用在安全检查、身份核验与移动支付中;物品识别主要运用在物品流通过程中,特别是无人货架、智能零售柜等无人零售领域。
2)目标检测:也叫目标提取,是一种基于目标几何和统计特征的图像分割,它将目标的分割和识别合二为一,其准确性和实时性是整个系统的一项重要能力。尤其是在复杂场景中,需要对多个目标进行实时处理时,目标自动提取和识别就显得特别重要。随着计算机技术的发展和计算机视觉原理的广泛应用,利用计算机图像处理技术对目标进行实时跟踪研究越 来越热门,对目标进行动态实时跟踪定位在智能化交通系统、智能监控系统、军事目标检测及医学导航手术中手术器械定位等方面具有广泛的应用价值。
3)单元:由神经网络中的至少一个节点组成,本申请实施例中的单元可以为由两个节点(第一中间节点和第二中间节点)构成,例如将第k-1单元和第k单元的输出结果输入到k+1单元中的第一中间节点,第一中间节点进行处理后将第一中间节点的输出结果输入到k+1单元中的第二中间节点,第二中间节点进行处理后将第二中间节点的输出结果输入到第k+2单元中。
4)超单元:由相同阶段(分辨率)的单元组成,例如k-1单元和第k单元的分辨率为128*128,k+1单元和第k+2单元的分辨率为64*64,则k-1单元和第k单元组成一种超单元,k+1单元和第k+2单元组成另一种超单元。
5)语义分割:通过对图像中的每个像素进行密集的预测、推断标签来实现细粒度的推理,从而使每个像素都被标记所属类别。即通过查找图像中所有像素所属类别来识别图像中存在的内容以及位置。
本申请实施例记载的语义分割网络结构可以应用于各种识别领域,例如可以是图像识别神经网络、目标检测神经网络、人脸检测神经网络、自动驾驶系统等识别领域,即本申请实施例中的语义分割网络结构并不局限于某种识别领域。
相关技术中,针对实时语义分割问题,包括手工设计的网络结构以及神经网络搜索方法。
其中,手工设计的网络结构(例如,密集特征融合方案、双向网络)需要用户不断的试错,尝试新结构并且重新训练。神经网络搜索方法(例如,CAS)虽然可以解决这一重复性的工作,但是不能动态调整网络的层 数,也没有考虑不同阶段特征的融合。
综上,本领域技术人员未对语义分割网络结构进行分析,且上述技术问题对于本领域技术人员来说并不是公知常识,因此,本领域技术人员难以发现并提出上述技术问题。而本申请实施例,针对语义分割网络结构进行了分析,从而发现上述技术问题。
为至少解决相关技术的上述技术问题,本申请实施例提供一种语义分割网络结构的生成方法、装置、电子设备和计算机可读存储介质,能够动态调整语义分割网络结构,从而提高实时分割的性能,降低计算复杂度,节约计算成本,并将训练后的语义分割网络结构应用于后续的语义分割操作中。下面说明本申请实施例提供的用于生成语义分割网络结构的电子设备的示例性应用,本申请实施例提供的用于生成语义分割网络结构的电子设备可以是服务器,例如部署在云端的服务器,根据其他电子设备或者用户提供的初始的语义分割网络结构以及图像样本,基于初始的语义分割网络结构以及图像样本进行一系列处理,得到对应训练后的语义分割网络结构,并向用户提供对应训练后的语义分割网络结构,以便进行后续的语义分割操作;也可是笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,个人数字助理)等各种类型的用户终端,例如手持终端,根据用户在手持终端上输入的初始的语义分割网络结构、以及图像样本,获得对应训练后的语义分割网络结构,并向用户提供对应训练后的语义分割网络结构,以便进行后续的语义分割操作。
作为示例,参见图1,图1是本申请实施例提供的语义分割网络结构的生成系统10的应用场景示意图,终端200通过网络300连接服务器100,网络300可以是广域网或者局域网,又或者是二者的组合。
在一些实施例中,终端200本地执行本申请实施例提供的语义分割网络结构的生成方法来完成根据用户输入的初始的语义分割网络结构以及图像样本,得到训练后的语义分割网络结构,例如,在终端200上安装语义 分割网络结构生成助手,用户在语义分割网络结构生成助手中,输入初始的语义分割网络结构以及图像样本,终端200根据输入的初始的语义分割网络结构以及图像样本,得到训练后的语义分割网络结构,并将训练后的语义分割网络结构显示在终端200的显示界面210上,以便用户根据训练后的语义分割网络结构进行图像识别、目标检测等应用。
在一些实施例中,终端200也可以通过网络300向服务器100发送用户在终端200上输入的初始的语义分割网络结构以及图像样本,并调用服务器100提供的语义分割网络结构的生成功能,服务器100通过本申请实施例提供的语义分割网络结构的生成方法获得训练后的语义分割网络结构,例如,在终端200上安装语义分割网络结构生成助手,用户在语义分割网络结构生成助手中,输入初始的语义分割网络结构以及图像样本,终端通过网络300向服务器100发送初始的语义分割网络结构以及图像样本,服务器100接收到该初始的语义分割网络结构以及图像样本进行一系列处理,得到训练后的语义分割网络结构,并返回训练后的语义分割网络结构至语义分割网络结构生成助手,将训练后的语义分割网络结构显示在终端200的显示界面210上,或者,服务器100直接给出训练后的语义分割网络结构,以便用户根据训练后的语义分割网络结构进行图像识别、目标检测等应用。
在一个实施场景中,为了得到针对图像识别的语义分割网络结构,服务器或者终端可以基于初始的语义分割网络结构以及图像样本,基于图像样本优化该初始的语义分割网络结构,并去除冗余单元,通过改进后的语义分割网络结构,确定图像样本中存在的物体以及物体对应的位置,并基于图像样本中存在的物体以及物体对应的位置,训练改进后的语义分割网络结构,以便后续可以根据训练后的语义分割网络结构对图像进行语义分割,确定图像所属类别,例如根据训练后的语义分割网络结构对图像进行语义分割,得到图像所对应的标签(小轿车、汽车、公交车等)。通过去除 目标单元所属超单元中的冗余单元,从而去除语义分割网络结构中的冗余计算单元,节省后续语义分割的计算量,并通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率的融合超单元输出进行自适应融合,提高图像实时分割的性能。
在一个实施场景中,为了得到针对目标检测的语义分割网络结构,服务器或者终端可以基于初始的语义分割网络结构以及目标对象样本,基于目标对象样本优化该初始的语义分割网络结构,并去除冗余单元,得到改进后的语义分割网络结构,通过改进后的语义分割网络结构,确定目标对象样本中存在的物体以及物体对应的位置,并基于目标对象样本中存在的物体以及物体对应的位置,训练改进后的语义分割网络结构,以便后续可以根据训练后的语义分割网络结构对目标对象进行语义分割,确定目标对象所属类别,例如根据训练后的语义分割网络结构对目标对象进行语义分割,得到目标对象所对应的标签(树木、行人、车辆等),从而检测出行人。通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网络结构中的冗余计算单元,节省后续语义分割的计算量,并通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率的融合超单元输出进行自适应融合,提高目标对象实时分割的性能。
在一个实施场景中,为了得到针对人脸识别的语义分割网络结构,服务器或者终端可以基于初始的语义分割网络结构、以及人脸样本,基于人脸样本优化该初始的语义分割网络结构,并去除冗余单元,通过改进后的语义分割网络结构,确定人脸样本中存在的物体以及物体对应的位置,基于人脸样本中存在的物体以及物体对应的位置,训练改进后的语义分割网络结构,以便后续可以根据训练后的语义分割网络结构对人脸进行语义分割,确定人脸所属类别,从而实现人脸识别,例如根据训练后的语义分割网络结构对人脸进行语义分割,得到人脸所对应的标签(小明、小红、小强等)。通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网 络结构中的冗余计算单元,节省后续语义分割的计算量,并通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率的融合超单元输出进行自适应融合,提高人脸实时分割的性能。
在一个实施场景中,为了得到针对自动驾驶的语义分割网络结构,服务器或者终端可以基于初始的语义分割网络结构以及路况驾驶样本,基于路况驾驶样本优化该初始的语义分割网络结构,并去除冗余单元,通过改进后的语义分割网络结构中的聚集单元,确定路况驾驶样本中存在的物体以及物体对应的位置,基于路况驾驶样本中存在的物体以及物体对应的位置,训练改进后的语义分割网络结构,以便后续可以根据训练后的语义分割网络结构对路况进行语义分割,确定路况所属的驾驶类别,从而实现根据路况自动驾驶,例如根据训练后的语义分割网络结构对路况进行语义分割,得到路况所对应的标签(左拐、右拐、直行等)。通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网络结构中的冗余计算单元,节省后续语义分割的计算量,并通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率的融合超单元输出进行自适应融合,提高路况实时分割的性能。
在一个实施场景中,为了得到针对视频编辑的语义分割网络结构,服务器或者终端可以基于初始的语义分割网络结构以及视频编辑样本,基于视频编辑样本优化该初始的语义分割网络结构,并去除架冗余单元,得到改进后的语义分割网络结构,通过改进后的语义分割网络结构,确定视频编辑样本中存在的物体以及物体对应的位置,基于视频编辑样本中存在的物体以及物体对应的位置,训练改进后的语义分割网络结构,以便后续可以根据训练后的语义分割网络结构对视频进行语义分割,确定视频所属的编辑类别,从而实现根据视频自动实时编辑,例如根据训练后的语义分割网络结构对视频进行语义分割,得到视频所对应的标签(裁剪、缩小、放大等)。通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网 络结构中的冗余计算单元,节省后续语义分割的计算量,并通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率的融合超单元输出进行自适应融合,提高视频实时分割的性能。
继续说明本申请实施例提供的用于生成语义分割网络结构的电子设备的结构,用于生成语义分割网络结构的电子设备可以是各种终端,例如手机、电脑等,也可以是如图1示出的服务器100。
参见图2,图2是本申请实施例提供的用于生成语义分割网络结构的电子设备500的结构示意图,图2所示的用于生成语义分割网络结构的电子设备500包括:至少一个处理器510、存储器550、至少一个网络接口520和用户接口530。用于生成语义分割网络结构的电子设备500中的各个组件通过总线系统540耦合在一起。可理解,总线系统540用于实现这些组件之间的连接通信。总线系统540除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统540。
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口530包括使得能够呈现媒体内容的一个或多个输出装置531,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口530还包括一个或多个输入装置532,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Rand om Access Memory)。本申请实施例描述的存储器550旨在包括任意适合 类型的存储器。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块552,用于经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
显示模块553,用于经由一个或多个与用户接口530相关联的输出装置531(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块554,用于对一个或多个来自一个或多个输入装置532之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的语义分割网络结构的生成装置可以采用软硬件结合的方式实现,作为示例,本申请实施例提供的语义分割网络结构的生成装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的语义分割网络结构的生成方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
在另一些实施例中,本申请实施例提供的语义分割网络结构的生成装置可以采用软件方式实现,图2示出了存储在存储器550中的语义分割网 络结构的生成装置555,其可以是程序和插件等形式的软件,并包括一系列的模块,包括添加模块5551、去除模块5552、融合模块5553、训练模块5554、以及合并模块5555;其中,添加模块5551、去除模块5552、融合模块5553、训练模块5554、以及合并模块5555用于实现本申请实施例提供的语义分割网络结构的生成方法。
根据上文可以理解,本申请实施例提供的语义分割网络结构的生成方法可以由各种类型的用于生成语义分割网络结构的电子设备实施,例如智能终端和服务器等。
下面结合本申请实施例提供的服务器的示例性应用和实施,说明本申请实施例提供的语义分割网络结构的生成方法。参见图3,图3是本申请实施例提供的语义分割网络结构的生成方法的流程示意图,结合图3示出的步骤进行说明。
在下面步骤中,超单元为由相同阶段(分辨率)的单元组成,例如k-1单元和第k单元的分辨率为128*128,则k-1单元和第k单元组成一种超单元。聚集单元用于进行特征融合,以自适应地融合不同尺度下的特征。
在步骤101中,为语义分割网络结构中组成超单元的各单元生成对应的架构参数。
作为获取语义分割网络结构的示例,用户在客户端(运行于终端)中输入初始的语义分割网络结构以及图像样本,终端自动获取针对语义分割网络结构的生成请求(包括初始的语义分割网络结构),并将针对语义分割网络结构的生成请求发送至服务器,服务器接收针对语义分割网络结构的生成请求,并提取语义分割网络结构。然后,为了后续更够去除超单元中的冗余单元,可以先为语义分割网络结构中组成超单元的各单元添加对应的架构参数。
参见图4,图4是本申请实施例提供的一个可选的流程示意图,在一些实施例中,图4示出图3中的步骤101之前,还包括步骤106。
在步骤106中,将语义分割网络结构中相同分辨率的单元合并为超单元;根据超单元的数量,确定聚集单元的结构。
在生成对应的架构参数之前,需要初始化语义分割网络结构,即构建语义分割网络结构中的超单元,其中超单元是由语义分割网络结构中相同分辨率的单元组成的,在将相同分辨率的单元合并为超单元后,确定超单元的数量,并根据超单元的数量,确定语义分割网络结构中聚集单元的结构,以便后续聚集单元进行下采样操作。
在一些实施例中,根据超单元的数量,确定聚集单元的结构,包括:确定超单元的数量N;将聚集单元中对应第i超单元的下采样单元的数量确定为N-i;其中,N为大于或者等于2的正整数,i为正整数,且i小于或者等于N。
在确定超单元的数量N后,根据超单元的顺序,依次确定聚集单元中对应N个超单元的下采样单元的数量。例如,当确定超单元的数量为3,则确定聚集单元中对应第1超单元的下采样单元的数量为2、对应第2超单元的下采样单元的数量为1、对应第3超单元的下采样单元的数量为0;当确定超单元的数量为4,则确定聚集单元中对应第1超单元的下采样单元的数量为3、对应第2超单元的下采样单元的数量为2、对应第3超单元的下采样单元的数量为1、对应第4超单元的下采样单元的数量为0。根据经验值,一般超单元的数量为3或4。
在一些实施例中,为语义分割网络结构中组成超单元的各单元添加对应的架构参数,包括:确定语义分割网络结构中组成超单元的单元的数量M;针对语义分割网络结构中组成超单元的各个单元,生成对应的取值为1/M的架构参数。
当初始化语义分割网络结构后,需要为语义分割网络结构中组成超单元的各单元添加对应的初始的架构参数。其中,初始的架构参数是根据超单元的单元的数量确定的。因此,在确定了语义分割网络结构中组成超单 元的单元的数量M后,可以针对语义分割网络结构中组成超单元的各个单元,生成对应的取值为1/M的架构参数。例如,确定语义分割网络结构中组成超单元的单元的数量为10,则针对语义分割网络结构中组成超单元的各个单元,生成对应的初始架构参数为0.1。
在步骤102中,基于图像样本优化语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构。
其中,目标单元为各单元中具有最大架构参数的单元。为了确定出架构参数最大值,需要基于图像样本优化语义分割网络结构,并去除优化后架构参数最大值对应的目标单元所属超单元中的冗余单元,从而实现动态调整语义分割网络结构,降低语义分割网络结构的深度。
在一些实施例中,基于图像样本优化语义分割网络结构,包括:基于图像样本,对语义分割网络结构的自身参数、各单元的操作参数以及架构参数进行联合训练,确定训练得到的最大架构参数,并将最大架构参数对应的单元确定为目标单元。
为了确定出架构参数最大值,需要基于图像样本对语义分割网络结构进行粗训练,只要能确定出架构参数最大值即可,其粗训练的过程为基于图像样本,对语义分割网络结构的自身参数、各单元的操作参数以及架构参数进行联合训练,确定出训练得到的最大架构参数后,并将最大架构参数对应的单元确定为目标单元。其中,单元的操作参数可以为池化操作、卷积操作、恒等映射等操作。
在一些实施例中,确定训练得到的最大架构参数,并将最大架构参数对应的单元确定为目标单元,包括:将训练得到的架构参数为1所对应的单元确定为目标单元。
在粗训练的过程中,当各单元的架构参数为1时,则将架构参数为1所对应的单元确定为目标单元,以便后续去除超单元中的冗余单元。例如,各单元初始的架构参数为0.1,经过粗训练后,第4单元的架构参数变换为 1,则第4单元为目标单元。
在一些实施例中,去除架构参数最大值对应的目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,包括:确定目标单元在所属的超单元中的排序j,并去除超单元中在排序j之后的冗余单元;根据去除冗余单元之后的超单元以及聚集单元,构建改进后的语义分割网络结构。
在确定目标单元后,可以确定目标单元在所属的超单元中的排序,并去除超单元中在该序列之后的冗余单元,从而基于除冗余单元之后的超单元以及聚集单元,构建改进后的语义分割网络结构,使得改进后的语义分割网络结构没有冗余单元,实现动态调整语义分割网络结构,降低语义分割网络结构的深度。例如,第1超单元包含10个单元,其中,第4单元的架构参数变换为1,则去除第1超单元中第4单元之后的冗余单元、即去除第1超单元中单元5-单元10,去除冗余单元后的第1超单元仅包含单元1-单元4。其中,粗训练过程中超单元的输出为超单元中单元的输出与各单元的架构参数的加权和,例如,f(x k)表示超单元中第k个单元输出的特征,β k表示第k个单元的架构参数,n表示单元的数量,则该超单元的输出为
Figure PCTCN2020114372-appb-000001
在步骤103中,通过改进后的语义分割网络结构中的聚集单元,对被去除冗余单元的超单元的输出进行特征融合,得到融合后的特征图。
在确定了改进后的语义分割网络结构后,可以语义分割网络结构中的聚集单元,对被去除冗余单元的超单元的输出进行特征融合,得到融合后的特征图,以便后续基于融合后的特征图对改进后的语义分割网络结构进行精训练,从而得到训练后的语义分割网络结构,便于后续对图像进行语义分割。
参见图5,图5是本申请实施例提供的一个可选的流程示意图,在一些实施例中,图5示出图3中的步骤103可以通过图5示出的步骤1031至步 骤1033实现。
在步骤1031中,通过改进后的语义分割网络结构中超单元的下采样单元,对输入的特征图进行下采样处理,得到对应超单元的特征图。
在通过粗训练确定单元中架构参数最大值,并去除架构参数最大值对应的目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构后,可以先通过改进后的语义分割网络结构中超单元的下采样单元对输入的特征图进行下采样处理,得到对应超单元的特征图。
其中,超单元包括下采样单元和正常单元,下采样单元的步长为2,从而实现下采样的功能,而正常单元的步长为1,不能实现下采样的功能。当确定改进后的语义分割网络结构后,向该改进后的语义分割网络结构输入图像,首先经过一层下采样为的卷积神经网络,然后依次经过三个连续的超单元,其中每个超单元的第一个单元为下采样单元,其余的单元为正常单元,通过各超单元的下采样单元对输入的特征图进行下采样处理,得到对应该超单元的特征图,并输入至下一个超单元或者聚集单元。
在一些实施例中,通过改进后的语义分割网络结构中超单元的下采样单元对输入的特征图进行下采样处理,得到对应超单元的特征图,包括:确定改进后的语义分割网络结构中去除冗余单元的第i超单元;通过第i超单元对输入的特征图进行下采样处理,得到对应第i超单元的特征图。
为了通过改进后的语义分割网络结构中超单元的下采样单元对输入的特征图进行下采样处理,首先可以确定改进后的语义分割网络结构中去除冗余单元的第i超单元,然后通过第i超单元对输入的特征图进行下采样处理,得到对应第i超单元的特征图,并输入至下一个超单元或者聚集单元。例如,当改进后的语义分割网络结构有3个超单元,则当确定改进后的语义分割网络结构中去除冗余单元的第1超单元,通过第1超单元对输入的特征图进行下采样处理,得到对应第1超单元的特征图,并将对应第1超单元的特征图输入至第1超单元以及聚集单元;当确定改进后的语义分割 网络结构中去除冗余单元的第2超单元,通过第2超单元对输入的对应第1超单元的特征图进行下采样处理,得到对应第2超单元的特征图,并将对应第2超单元的特征图输入至第3超单元以及聚集单元;当确定改进后的语义分割网络结构中去除冗余单元的第3超单元,通过第3超单元对输入的对应第2超单元的特征图进行下采样处理,得到对应第3超单元的特征图,并将对应第3超单元的特征图输入至聚集单元。
在步骤1032中,通过聚集单元中的下采样单元,依次对被去除冗余单元的超单元输出的特征图进行下采样处理,得到对应超单元的多个相同分辨率的特征图。
在超单元的下采样单元对输入的特征图进行下采样处理后,可以再通过聚集单元中的下采样单元,依次对被去除冗余单元的超单元输出的特征图进行下采样处理,从而得到对应超单元的各相同分辨率的特征图,以便后续对相同分辨率的特征图进行融合处理。
在一些实施例中,通过聚集单元中的下采样单元,依次对被去除冗余单元的超单元输出的特征图进行下采样处理,得到对应超单元的多个相同分辨率的特征图,包括:通过聚集单元中的N-i个下采样单元对第i超单元进行N-i次下采样处理,得到对应第i超单元的下采样特征图。
其中,N个超单元的下采样特征图的分辨率相同。为了使得N个超单元的下采样特征图的分辨率相同,可以通过聚集单元中的下采样单元对超单元进行下采样处理,得到对应超单元的下采样特征图,即通过聚集单元中的N-i个下采样单元对第i超单元进行N-i次下采样处理,得到对应第i超单元的下采样特征图。例如,当改进后的语义分割网络结构有3个超单元,则通过聚集单元中的第2个下采样单元对第1超单元输出的特征图进行2次下采样处理,得到对应第1超单元的下采样特征图;通过聚集单元中的1个下采样单元对第2超单元输出的特征图进行1次下采样处理,得到对应第2超单元的下采样特征图;不对第3超单元输出的特征图进行下 采样处理,可以通过聚集单元中的正常单元对第3超单元进行除下采样之外的其他操作,得到对应第3超单元的正常特征图。最后,可以将对应第1超单元的下采样特征图、对应第2超单元的下采样特征图以及对应第3超单元的正常特征图输入至聚集单元中的正常单元再次进行除下采样之外的其他操作,以便后续进行融合处理。
在步骤1033中,对多个相同分辨率的特征图进行融合处理,得到融合后的特征图。
在得到了相同分辨率的特征图后,可以对各相同分辨率的特征图进行融合处理,得到融合后的特征图,以便进行后续的语义分割处理。其中,融合处理可以是拼接处理、即将对应超单元的下采样特征图依次拼接,得到融合后的特征图。通过特征融合,可以对不同分辨率超单元的输出进行自适应融合,提高实时分割的性能,适用于各种语义分割的应用场景。
在步骤104中,对融合后的特征图进行识别处理,确定图像样本中存在的物体对应的位置。
在得到融合后的特征图后,为了对改进后的语义分割网络结构进行精训练,可以对融合后的特征图进行识别处理,从而确定图像样本中存在的物体以及物体对应的位置,以便后续基于图像样本中存在的物体以及物体对应的位置对改进后的语义分割网络结构进行训练。
参见图6,图6是本申请实施例提供的一个可选的流程示意图,在一些实施例中,图6示出图3中的步骤104可以通过图6示出的步骤1041至步骤1042实现。
在步骤1041中,对融合后的特征图进行特征映射,得到对应图像样本的映射特征图。
由于融合后的特征图为低分辨率的特征图,需要将低分辨率的特征图进行特征映射,将低分辨率的特征图映射到像素级别的特征图上,可以通过上采样产生特征密集的特征图。因此可以对融合后的特征图进行特征映 射,得到对应图像样本的映射特征图,以便后续进行识别处理。
在步骤1042中,对对应图像样本的映射特征图进行识别处理,确定图像样本中存在的物体对应的位置。
在得到对应图像样本的映射特征图后,根据语义分割方法对对应图像样本的映射特征图进行识别处理,从而确定图像样本中存在的物体以及物体对应的位置,以便后续根据图像样本中存在的物体以及物体对应的位置对改进后的语义分割网络结构进行训练。
在步骤105中,基于图像样本中存在的物体对应的位置、以及图像样本所对应的标注,对改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
在得到图像样本中存在的物体以及物体对应的位置后,可以获取图像样本所对应的标注,其中图像样本所对应的标注为用户预先通过人工进行标注的图像样本中存在的物体以及物体对应的位置,当获得图像样本中存在的物体以及物体对应的位置、和图像样本所对应的标注之后,可以基于图像样本中存在的物体以及物体对应的位置、和图像样本所对应的标注,对改进后的语义分割网络结构进行迭代训练,从而生成训练后的语义分割网络结构,以便后续通过训练后的语义分割网络结构对其他的图像进行语义分割。
在一些实施例中,基于图像样本中存在的物体以及物体对应的位置、和图像样本所对应的标注,对改进后的语义分割网络结构进行训练,包括:基于图像样本中存在的物体对应的位置、以及图像样本所对应的标注,构建改进后的语义分割网络结构的损失函数;更新改进后的语义分割网络结构的自身参数直至损失函数收敛。
当服务器基于图像样本中存在的物体对应的位置、以及图像样本所对应的标注,构建改进后的语义分割网络结构的损失函数的值后,可以判断损失函数的值是否超出预设阈值,当损失函数的值超出预设阈值时,基于 损失函数确定改进后的语义分割网络结构的误差信号,将误差信息在改进后的语义分割网络结构中反向传播,并在传播的过程中更新各个层的模型参数。
这里,对反向传播进行说明,将训练样本数据输入到神经网络模型的输入层,经过隐藏层,最后达到输出层并输出结果,这是神经网络模型的前向传播过程,由于神经网络模型的输出结果与实际结果有误差,则计算输出结果与实际值之间的误差,并将该误差从输出层向隐藏层反向传播,直至传播到输入层,在反向传播的过程中,根据误差调整模型参数的值;不断迭代上述过程,直至收敛,其中,语义分割网络结构属于神经网络模型。
至此已经结合本申请实施例提供的服务器的示例性应用和实施,说明本申请实施例提供的语义分割网络结构的生成方法,下面继续说明本申请实施例提供的语义分割网络结构的生成装置555中各个模块配合实现语义分割网络结构的生成的方案。
添加模块5551,配置为为语义分割网络结构中组成超单元的各单元生成对应的架构参数;去除模块5552,配置为基于图像样本优化所述语义分割网络结构,并去除架构参数最大值对应的目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构;融合模块5553,配置为通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图;训练模块5554,配置为对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体以及所述物体对应的位置;基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
在一些实施例中,所述语义分割网络结构的生成装置555还包括:合 并模块5555,配置为将所述语义分割网络结构中相同分辨率的单元合并为所述超单元;根据所述超单元的数量,确定所述聚集单元的结构。
在一些实施例中,所述合并模块5555还配置为确定所述超单元的数量N;将所述聚集单元中对应第i超单元的下采样单元的数量确定为N-i;其中,N为大于或者等于2的正整数,i为正整数,且i小于或者等于N。
在一些实施例中,所述添加模块5551还配置为确定所述语义分割网络结构中组成所述超单元的单元的数量M;针对所述语义分割网络结构中组成所述超单元的各个单元,生成对应的取值为1/M的架构参数。
在一些实施例中,所述去除模块5552还配置为基于图像样本,对所述语义分割网络结构的自身参数、所述各单元的操作参数以及所述架构参数进行联合训练,确定训练得到的最大架构参数最大值,并将所述最大架构参数对应的单元确定为目标单元。
在一些实施例中,所述去除模块5552还配置为将训练得到的架构参数为1所对应的单元确定为目标单元。
在一些实施例中,所述去除模块5552还配置为确定所述目标单元在所属的超单元中的排序j,并去除所述超单元中在排序j之后的冗余单元;根据去除冗余单元之后的超单元以及所述聚集单元,构建改进后的语义分割网络结构。
在一些实施例中,所述融合模块5553还配置为通过所述改进后的语义分割网络结构中超单元的下采样单元,对输入的特征图进行下采样处理,得到对应所述超单元的特征图;通过所述聚集单元中的下采样单元,依次对被去除所述冗余单元的超单元输出的特征图进行下采样处理,得到对应所述超单元的多个相同分辨率的特征图;对所述多个相同分辨率的特征图进行融合处理,得到融合后的特征图。
在一些实施例中,所述融合模块5553还配置为确定所述改进后的语义分割网络结构中去除所述冗余单元的第i超单元;通过所述第i超单元对输 入的特征图进行下采样处理,得到对应所述第i超单元的特征图;通过所述聚集单元中的N-i个下采样单元对所述第i超单元进行N-i次下采样处理,得到对应所述第i超单元的下采样特征图;其中,对应N个所述超单元的下采样特征图的分辨率相同。
在一些实施例中,所述训练模块5554还配置为对所述融合后的特征图进行特征映射,得到对应所述图像样本的映射特征图;对所述对应所述图像样本的映射特征图进行识别处理,确定所述图像样本中存在的物体以及所述物体对应的位置;基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,构建所述改进后的语义分割网络结构的损失函数;更新所述改进后的语义分割网络结构的自身参数直至所述损失函数收敛。
基于上述对语义分割网络结构的生成方法以及结构的说明,接下来对本申请实施例提供的用于图像的语义分割的电子设备的示例性应用,作为示例,参见图7,图7是本申请实施例提供的图像的语义分割系统20的应用场景示意图,终端200通过网络300连接服务器100,网络300可以是广域网或者局域网,又或者是二者的组合。
在一些实施例中,终端200本地执行本申请实施例提供的图像的语义分割方法来完成根据用户输入的待语义分割的图像,得到待语义分割的图像中存在的物体以及物体对应的位置,例如,在终端200上安装语义分割助手,用户在语义分割助手中,输入待语义分割的图像,终端200根据输入的待语义分割的图像,得到待语义分割的图像中存在的物体以及物体对应的位置,并将待语义分割的图像中存在的物体以及物体对应的位置显示在终端200的显示界面210上。
在一些实施例中,终端200也可以通过网络300向服务器100发送用户在终端200上输入的待语义分割的图像,并调用服务器100提供的图像 的语义分割功能,服务器100通过本申请实施例提供的图像的语义分割方法获得待语义分割的图像中存在的物体以及物体对应的位置,例如,在终端200上安装语义分割助手,用户在语义分割助手中,输入待语义分割的图像,终端通过网络300向服务器100发送待语义分割的图像,服务器100接收到该待语义分割的图像后,通过对该待语义分割的图像进行识别处理,得到待语义分割的图像中存在的物体以及物体对应的位置,并返回待语义分割的图像中存在的物体以及物体对应的位置至语义分割助手,将待语义分割的图像中存在的物体以及物体对应的位置显示在终端200的显示界面210上,或者,服务器100直接给出待语义分割的图像中存在的物体以及物体对应的位置。
基于上述图像的语义分割系统进行说明。参见图8,图8是本申请实施例提供的用于图像的语义分割的电子设备600的结构示意图,图8所示的用于图像的语义分割的电子设备600包括:至少一个处理器610、存储器650、至少一个网络接口620和用户接口630。其中,处理器610、存储器650、至少一个网络接口620和用户接口630的功能分别与处理器510、存储器550、至少一个网络接口520和用户接口530的功能类似,即输出装置631、输入装置632的功能与输出装置531、输入装置532的功能类似,操作系统651、网络通信模块652、显示模块653、输入处理模块654的功能分别与操作系统551、网络通信模块552、显示模块553、输入处理模块554的功能类似,不做赘述。
在另一些实施例中,本申请实施例提供的图像的语义分割装置可以采用软件方式实现,图8示出了存储在存储器650中的图像的语义分割装置655,其可以是程序和插件等形式的软件,并包括一系列的模块,包括确定模块6551以及处理模块6552;其中,确定模块6551以及处理模块6552用于实现本申请实施例提供的图像的语义分割方法。
根据上文可以理解,本申请实施例提供的图像的语义分割方法可以由 各种类型的用于图像的语义分割的电子设备实施,例如智能终端和服务器等。
下面结合本申请实施例提供的服务器的示例性应用和实施,说明本申请实施例提供的图像的语义分割方法。参见图9,图9是本申请实施例提供的图像的语义分割方法的流程示意图,结合图9示出的步骤进行说明。
在步骤201中,确定待语义分割的图像。
例如,用户在终端上输入的待语义分割的图像,输入完成后,终端也可以通过网络向服务器发送用户在终端上输入的待语义分割的图像,服务器接收到该待语义分割的图像后,可以确定待语义分割的图像,以进行语义分割。
在步骤202中,通过训练后的语义分割网络结构对待语义分割的图像进行识别处理,确定待语义分割的图像中存在的物体以及物体对应的位置,并通过预设标注方式标注物体以及物体对应的位置。
其中,通过训练后的语义分割网络结构中的超单元对带分割的图像进行下采样的处理后,并通过聚集单元对超单元的输出进行特征融合,得到融合后的特征图,再对融合后的特征图进行识别处理,从而确定待语义分割的图像中存在的物体以及物体对应的位置,并通过预设标注方式标注物体以及物体对应的位置,以便用户查看语义分割后的图像。其中,预设标注方式可以是通过不同的颜色对不同的物体进行标注,还可以通过标注方框对待语义分割的图像中存在的物体进行标注,还可以虚线框沿着物体的边缘进行标注,本申请实施例的预设标注方式并不限于上述标注方式。
在一些实施例中,通过聚集单元对超单元的输出进行特征融合,得到融合后的特征图,包括:通过超单元的下采样单元对输入的特征图进行下采样处理,得到对应超单元的特征图;通过聚集单元中的下采样单元,依次对超单元输出的特征图进行下采样处理,得到对应超单元的多个相同分辨率的特征图;对多个相同分辨率的特征图进行融合处理,得到融合后的 特征图。
在一些实施例中,通过超单元的下采样单元对输入的特征图进行下采样处理,得到对应超单元的特征图,包括:确定第i超单元;通过第i超单元对输入的特征图进行下采样处理,得到对应第i超单元的特征图;
例如,通过聚集单元中的下采样单元,依次对超单元输出的特征图进行下采样处理,得到对应超单元的各相同分辨率的特征图,包括:通过聚集单元中的N-i个下采样单元对第i超单元进行N-i次下采样处理,得到对应第i超单元的下采样特征图;其中,对应N个超单元的下采样特征图的分辨率相同。
在一些实施例中,对融合后的特征图进行识别处理,从而确定待语义分割的图像中存在的物体以及物体对应的位置,包括:对融合后的特征图进行特征映射,得到对应待语义分割的图像的映射特征图;对对应待语义分割的图像的映射特征图进行识别处理,确定待语义分割的图像中存在的物体以及物体对应的位置。
至此已经说明本申请实施例提供的图像的语义分割方法,下面继续说明本申请实施例提供的图像的语义分割装置655中各个模块配合实现图像的语义分割的方案。
确定模块6551,配置为确定待语义分割的图像;
处理模块6552,配置为通过训练后的语义分割网络结构对待语义分割的图像进行识别处理,确定待语义分割的图像中存在的物体以及物体对应的位置,并通过预设标注方式标注物体以及物体对应的位置。
这里需要指出的是:以上涉及装置的描述,与上述方法描述是类似的,同方法的有益效果描述,不做赘述,对于本申请实施例所述装置中未披露的技术细节,请参照本申请方法实施例的描述。
本申请实施例还提供一种存储有可执行指令的计算机可读存储介质, 其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的语义分割网络结构的生成方法,例如,如图3-6示出的语义分割网络结构的生成方法,或本申请实施例提供的图像的语义分割方法,例如,如图9示出的图像的语义分割方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
为解决手工设计的网络结构以及神经网络搜索方法所产生的问题,本申请实施例提出一种语义分割网络结构的生成方法,将原始的单元按照不同下采样阶段构建多个超单元,并引入超单元架构参数,基于架构参数自适应地调整每个阶段的单元个数;同时,针对图像分割中上下文特征的聚 集构建聚集单元,更好地融合不同尺度间的特征。该方法可以在语义分割中,生成出实时高每秒传输帧数(FPS,Frames Per Second)且高性能的语义分割网络,用于自动驾驶、手机端等实时领域。
在语义分割问题中,下采样策略(在合适的地方进行下采样)非常关键,将这一过程建模成单元级别剪枝的过程。如图10所示,图10是本申请实施例提供的超单元结构示意图,将原始的单元结构按照不同下采样阶段(分辨率)分成多个超单元,相同分辨率大小的单元结构属于同一种超单元,在单元间引入架构参数;另外,对于语义分割问题中,不同尺度的特征融合也十分关键,对于高分辨率的空间信息(靠前的超单元的输出)和低分辨率的语义信息(靠后的超单元的输出),通过建立聚集单元,可以有效的自适应融合特征,提高实时分割网络的性能。
本申请实施例的实现分为三个阶段,分别是:1)单元级别剪枝;2)聚集单元搜索;3)网络预训练和再训练。
1)单元级别剪枝
如图10所示,将相同阶段(分辨率)的单元合并成为超单元,并且引入单元级别的架构参数β,并融合超单元中每个单元的输出,将整个超单元的输出表示成单元输出的结合,超单元的输出的计算公式如公式(1)所示:
Figure PCTCN2020114372-appb-000002
其中,f(x k)表示超单元中第k个单元输出的特征,β k表示第k个单元的架构参数,n表示单元的数量,Output_{super_cell}表示超单元的输出。
架构参数β k采样方式为类别变分自编码器(Gumbel Softmax),并在训练过程中优化Gumbel Softmax为one-hot编码,通过引入单元级别架构参数β k,将单元级别架构参数β k、网络本身的参数以及单元内原始的架构参数λ(候选操作集合中的任意一种操作)一起联合训练,在同一轮反向传播 中更新这三种参数。最终优化完成后,如图10,如果架构参数的最大值为虚线β 3所示,那么其后面的单元(单元2-4)将会被舍弃,从而实现动态调整网络的深度。
其中,如图11所示,图11是本申请实施例提供的单元结构示意图,本申请实施例中的单元由神经网络中的至少一个节点组成,本申请实施例中的单元可以由两个节点(中间节点1和中间节点2)构成,例如将第k-1单元和第k-2单元的输出结果输入到k单元中的中间节点1,中间节点1进行操作处理后将中间节点1的输出结果输入到中间节点2,中间节点进行操作处理后将中间节点2的输出结果输入到第k+1单元以及第k+2单元中,其中实线表示候选操作集合中的任意一项操作,虚线表示输出。
如图12所示,图12是本申请实施例提供的语义分割网络结构的示意图,其中输入为图像,首先经过一层下采样为2的卷积神经网络(CNN),然后经过三个连续的超单元,每个超单元中的第一个单元为下采样单元(Reduction Cell),其余单元为正常单元(Normal Cell),整个语义分割网络结构将图片下采样16倍,最后三个超单元的输出合并,并输入至聚集单元。
其中,下采样单元(Reduction Cell)和正常单元(Normal Cell)的候选操作集合可以由平均池化层、最大池化层、1x1卷积层、恒等映射、3x3卷积层、3x3空洞卷积、5x5空洞卷积、3x3组卷积等组成,即下采样单元和正常单元可以由候选操作集合中的任意一个操作组成。
2)聚集单元搜索
本申请实施例通过使用聚集单元来融合不同分辨率大小的特征图,将低层的空间特征(靠前的超单元的输出,例如超单元1)和深层的语义信息(靠后的超单元的输出,例如超单元3)融合,三个超单元的输出分别需要经过2次、1次、0次下采样来达到同样大小的特征图尺寸,如图13所示,图13是本申请实施例提供的聚集单元的结构示意图,聚集单元总共由5个 单元组成,其中对应超单元1有两个下采样单元(Reduction Cell),对应超单元2有1个下采样单元,对应超单元3有1个正常单元(Normal Cell),并将下采样单元2、下采样单元3、正常单元1的输出输入至正常单元2,即将三个超单元输出的特征处理成相同尺寸后,再进行特征拼接来有效的自适应的融合特征,提高实时分割网络的性能。
对于聚集单元的候选操作集合可以由平均池化层、最大池化层、1x1卷积层、恒等映射、3x3卷积层、3x3空洞卷积、3x3组卷积和通道注意力机制层、空间注意力机制层等组成,即聚集单元中的单元可以由候选操作集合中的任意一个操作组成。
3)网络预训练和再训练
在本申请实施例基于前两个阶段搜索出的结构,可以得到完整的神经网络结构(改进后的语义分割网络结构),并使用ImageNet数据集进行预训练,提高网络结构的泛化能力且更好的初始化网络结构的参数;然后将网络结构在分割数据集上进行重新训练,得到更加高效的语义分割网络结构(训练后的语义分割网络结构)。
通过预训练和再训练确定训练后的语义分割网络结构后,可以根据用户输入的待语义分割的图像,通过训练后的语义分割网络结构对待语义分割的图像进行识别处理,确定待语义分割的图像中存在的物体以及物体对应的位置,并通过预设标注方式标注物体以及物体对应的位置,从而得到语义分割后的图像。
在本申请实施例中,一方面可以通过将原始的单元按照不同下采样阶段划分成超单元,并引入可微分的超单元级别的架构参数,通过单元级别裁剪来自适应地调整每个阶段(超单元)的单元个数;另一方面,针对不同尺度的特征融合,通过聚集单元自适应地融合不同尺度下的特征,从而可以生成更加高效的语义分割网络结构。
综上所述,本申请实施例通过去除架构参数最大值对应的目标单元所 属超单元中的冗余单元以及对被去除冗余单元的超单元的输出进行特征融合,具有以下有益效果:
通过去除目标单元所属超单元中的冗余单元,从而去除语义分割网络结构中的冗余计算单元,节省后续语义分割的计算量,实现动态调整语义分割网络结构,降低语义分割网络结构的深度;通过去除目标单元所属超单元中的冗余单元,从而确定超单元中的下采样单元在语义分割网络结构的位置,从而在合适的位置进行下采样;通过聚集单元对被去除冗余单元的超单元的输出进行特征融合,从而对不同分辨率超单元的输出进行自适应融合,提高实时分割的性能,适用于各种语义分割的应用场景。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。
工业实用性
本申请实施例中电子设备基于图像样本优化语义分割网络结构,以去除冗余单元,并通过改进后的语义分割网络结构中的聚集单元,对被去除冗余单元的超单元的输出进行特征融合,识别融合后的特征图,以确定图像样本中存在的物体对应的位置,基于图像样本中存在的物体对应的位置,训练改进后的语义分割网络结构进行训练,以生成训练后的语义分割网络结构。

Claims (15)

  1. 一种语义分割网络结构的生成方法,所述方法由电子设备执行,所述语义分割网络结构包括超单元以及聚集单元,
    所述方法包括:
    为所述语义分割网络结构中组成所述超单元的各单元生成对应的架构参数;
    基于图像样本优化所述语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,其中,所述目标单元为所述各单元中具有最大架构参数的单元;
    通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图;
    对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体对应的位置;
    基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
  2. 根据权利要求1所述的方法,其中,在为所述语义分割网络结构中组成超单元的各单元生成对应的架构参数之前,所述方法还包括:
    将所述语义分割网络结构中相同分辨率的单元合并为所述超单元;
    根据所述超单元的数量,确定所述聚集单元的结构。
  3. 根据权利要求2所述的方法,其中,所述根据所述超单元的数量,确定所述聚集单元的结构,包括:
    确定所述超单元的数量N;
    将所述聚集单元中对应第i超单元的下采样单元的数量确定为N-i;
    其中,N为大于或者等于2的正整数,i为正整数,且i小于或者等于 N。
  4. 根据权利要求3所述的方法,其中,所述通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图,包括:
    通过所述改进后的语义分割网络结构中超单元的下采样单元,对输入的特征图进行下采样处理,得到对应所述超单元的特征图;
    通过所述聚集单元中的下采样单元,依次对被去除所述冗余单元的超单元输出的特征图进行下采样处理,得到对应所述超单元的多个相同分辨率的特征图;
    对所述多个相同分辨率的特征图进行融合处理,得到融合后的特征图。
  5. 根据权利要求4所述的方法,其中,所述通过所述改进后的语义分割网络结构中超单元的下采样单元,对输入的特征图进行下采样处理,得到对应所述超单元的特征图,包括:
    确定所述改进后的语义分割网络结构中去除所述冗余单元的第i超单元;
    通过所述第i超单元对输入的特征图进行下采样处理,得到对应所述第i超单元的特征图;
    所述通过所述聚集单元中的下采样单元,依次对被去除所述冗余单元的超单元输出的特征图进行下采样处理,得到对应所述超单元的多个相同分辨率的特征图,包括:
    通过所述聚集单元中的N-i个下采样单元对所述第i超单元进行N-i次下采样处理,得到对应所述第i超单元的下采样特征图;
    其中,N个所述超单元的下采样特征图的分辨率相同。
  6. 根据权利要求1所述的方法,其中,所述为所述语义分割网络结构中组成所述超单元的各单元生成对应的架构参数,包括:
    确定所述语义分割网络结构中组成所述超单元的单元的数量M;
    针对所述语义分割网络结构中组成所述超单元的各个单元,生成对应的取值为1/M的架构参数。
  7. 根据权利要求1所述的方法,其中,所述基于图像样本优化所述语义分割网络结构,包括:
    基于图像样本,对所述语义分割网络结构的自身参数、所述各单元的操作参数以及所述架构参数进行联合训练,确定训练得到的最大架构参数,并
    将所述最大架构参数对应的单元确定为目标单元。
  8. 根据权利要求7所述的方法,其中,所述确定训练得到的最大架构参数,并将所述最大架构参数对应的单元确定为目标单元,包括:
    将训练得到的架构参数为1所对应的单元确定为目标单元。
  9. 根据权利要求1所述的方法,其中,所述去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,包括:
    确定所述目标单元在所属的超单元中的排序j,并去除所述超单元中在排序j之后的冗余单元;
    根据去除冗余单元之后的超单元以及所述聚集单元,构建改进后的语义分割网络结构。
  10. 根据权利要求1所述的方法,其中,所述对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体对应的位置,包括:
    对所述融合后的特征图进行特征映射,得到对应所述图像样本的映射特征图;
    对所述对应所述图像样本的映射特征图进行识别处理,确定所述图像样本中存在的物体对应的位置;
    所述基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,包括:
    基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对 应的标注,构建所述改进后的语义分割网络结构的损失函数;
    更新所述改进后的语义分割网络结构的自身参数直至所述损失函数收敛。
  11. 一种图像的语义分割方法,所述方法由电子设备执行,应用于如权利要求1至10任一项所述的训练后的语义分割网络结构;
    所述方法包括:
    确定待语义分割的图像;
    通过所述训练后的语义分割网络结构对所述待语义分割的图像进行识别处理,确定所述待语义分割的图像中存在的物体以及所述物体对应的位置,并
    通过预设标注方式标注所述物体以及所述物体对应的位置。
  12. 一种语义分割网络结构的生成装置,所述装置包括:
    添加模块,配置为为语义分割网络结构中组成超单元的各单元生成对应的架构参数;
    去除模块,配置为基于图像样本优化所述语义分割网络结构,并去除目标单元所属超单元中的冗余单元,得到改进后的语义分割网络结构,其中,所述目标单元为所述各单元中具有最大架构参数的单元;
    融合模块,配置为通过所述改进后的语义分割网络结构中的聚集单元,对被去除所述冗余单元的超单元的输出进行特征融合,得到融合后的特征图;
    训练模块,配置为对所述融合后的特征图进行识别处理,确定所述图像样本中存在的物体对应的位置;
    基于所述图像样本中存在的物体对应的位置、以及所述图像样本所对应的标注,对所述改进后的语义分割网络结构进行训练,以得到训练后的语义分割网络结构。
  13. 一种图像的语义分割装置,所述装置包括:
    确定模块,配置为确定待语义分割的图像;
    处理模块,配置为通过训练后的语义分割网络结构对所述待语义分割的图像进行识别处理,确定所述待语义分割的图像中存在的物体以及所述物体对应的位置,并
    通过预设标注方式标注所述物体以及所述物体对应的位置。
  14. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至10任一项所述的语义分割网络结构的生成方法。
  15. 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现权利要求1至10任一项所述的语义分割网络结构的生成方法,或权利要求11所述的图像的语义分割方法。
PCT/CN2020/114372 2019-11-12 2020-09-10 语义分割网络结构的生成方法、装置、设备及存储介质 WO2021093435A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/515,180 US20220051056A1 (en) 2019-11-12 2021-10-29 Semantic segmentation network structure generation method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911102046.3A CN110837811B (zh) 2019-11-12 2019-11-12 语义分割网络结构的生成方法、装置、设备及存储介质
CN201911102046.3 2019-11-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/515,180 Continuation US20220051056A1 (en) 2019-11-12 2021-10-29 Semantic segmentation network structure generation method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021093435A1 true WO2021093435A1 (zh) 2021-05-20

Family

ID=69574819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114372 WO2021093435A1 (zh) 2019-11-12 2020-09-10 语义分割网络结构的生成方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20220051056A1 (zh)
CN (1) CN110837811B (zh)
WO (1) WO2021093435A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642576A (zh) * 2021-08-24 2021-11-12 凌云光技术股份有限公司 一种目标检测及语义分割任务中训练图像集合的生成方法及装置
CN114677514A (zh) * 2022-04-19 2022-06-28 苑永起 一种基于深度学习的水下图像语义分割模型
CN114693934A (zh) * 2022-04-13 2022-07-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN115331245A (zh) * 2022-10-12 2022-11-11 中南民族大学 一种基于图像实例分割的表格结构识别方法
CN116645696A (zh) * 2023-05-31 2023-08-25 长春理工大学重庆研究院 一种用于多模态行人检测的轮廓信息引导特征检测方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837811B (zh) * 2019-11-12 2021-01-05 腾讯科技(深圳)有限公司 语义分割网络结构的生成方法、装置、设备及存储介质
CN111462126B (zh) * 2020-04-08 2022-10-11 武汉大学 一种基于边缘增强的语义图像分割方法及系统
CN111881768A (zh) * 2020-07-03 2020-11-03 苏州开心盒子软件有限公司 一种文档版面分析方法
CN111914654B (zh) * 2020-07-03 2024-05-28 苏州开心盒子软件有限公司 一种文本版面分析方法、装置、设备和介质
CN112329254A (zh) * 2020-11-13 2021-02-05 的卢技术有限公司 一种对接仿真环境图像与真实环境图像的自动驾驶方法
CN112488297B (zh) * 2020-12-03 2023-10-13 深圳信息职业技术学院 一种神经网络剪枝方法、模型生成方法及装置
CN112906707B (zh) * 2021-05-10 2021-07-09 武汉科技大学 一种表面缺陷图像的语义分割方法、装置及计算机设备
CN113343999B (zh) * 2021-06-15 2022-04-08 萱闱(北京)生物科技有限公司 基于目标检测的目标边界记录方法、装置和计算设备
CN114419449B (zh) * 2022-03-28 2022-06-24 成都信息工程大学 一种自注意力多尺度特征融合的遥感图像语义分割方法
CN115019037A (zh) * 2022-05-12 2022-09-06 北京百度网讯科技有限公司 对象分割方法及对应模型的训练方法、装置及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984953A (zh) * 2014-04-23 2014-08-13 浙江工商大学 基于多特征融合与Boosting决策森林的街景图像的语义分割方法
WO2017210690A1 (en) * 2016-06-03 2017-12-07 Lu Le Spatial aggregation of holistically-nested convolutional neural networks for automated organ localization and segmentation in 3d medical scans
CN107679502A (zh) * 2017-10-12 2018-02-09 南京行者易智能交通科技有限公司 一种基于深度学习图像语义分割的人数估计方法
CN107766794A (zh) * 2017-09-22 2018-03-06 天津大学 一种特征融合系数可学习的图像语义分割方法
CN109886282A (zh) * 2019-02-26 2019-06-14 腾讯科技(深圳)有限公司 对象检测方法、装置、计算机可读存储介质和计算机设备
CN110837811A (zh) * 2019-11-12 2020-02-25 腾讯科技(深圳)有限公司 语义分割网络结构的生成方法、装置、设备及存储介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008177B (zh) * 2014-06-09 2017-06-13 华中师范大学 面向图像语义标注的规则库结构优化与生成方法及系统
CN107180430A (zh) * 2017-05-16 2017-09-19 华中科技大学 一种适用于语义分割的深度学习网络构建方法及系统
JP6833620B2 (ja) * 2017-05-30 2021-02-24 株式会社東芝 画像解析装置、ニューラルネットワーク装置、学習装置、画像解析方法およびプログラム
US10095977B1 (en) * 2017-10-04 2018-10-09 StradVision, Inc. Learning method and learning device for improving image segmentation and testing method and testing device using the same
CN108062756B (zh) * 2018-01-29 2020-04-14 重庆理工大学 基于深度全卷积网络和条件随机场的图像语义分割方法
CN108876793A (zh) * 2018-04-13 2018-11-23 北京迈格威科技有限公司 语义分割方法、装置和系统及存储介质
CN108876792B (zh) * 2018-04-13 2020-11-10 北京迈格威科技有限公司 语义分割方法、装置和系统及存储介质
CN108596184B (zh) * 2018-04-25 2021-01-12 清华大学深圳研究生院 图像语义分割模型的训练方法、可读存储介质及电子设备
CN109191392B (zh) * 2018-08-09 2021-06-04 复旦大学 一种语义分割驱动的图像超分辨率重构方法
CN109461157B (zh) * 2018-10-19 2021-07-09 苏州大学 基于多级特征融合及高斯条件随机场的图像语义分割方法
CN109753995B (zh) * 2018-12-14 2021-01-01 中国科学院深圳先进技术研究院 一种基于PointNet++的3D点云目标分类和语义分割网络的优化方法
CN109872374A (zh) * 2019-02-19 2019-06-11 江苏通佑视觉科技有限公司 一种图像语义分割的优化方法、装置、存储介质及终端
CN110110692A (zh) * 2019-05-17 2019-08-09 南京大学 一种基于轻量级全卷积神经网络的实时图像语义分割方法
CN110223298A (zh) * 2019-05-27 2019-09-10 东南大学 基于点云局部结构的语义分割改进算法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984953A (zh) * 2014-04-23 2014-08-13 浙江工商大学 基于多特征融合与Boosting决策森林的街景图像的语义分割方法
WO2017210690A1 (en) * 2016-06-03 2017-12-07 Lu Le Spatial aggregation of holistically-nested convolutional neural networks for automated organ localization and segmentation in 3d medical scans
CN107766794A (zh) * 2017-09-22 2018-03-06 天津大学 一种特征融合系数可学习的图像语义分割方法
CN107679502A (zh) * 2017-10-12 2018-02-09 南京行者易智能交通科技有限公司 一种基于深度学习图像语义分割的人数估计方法
CN109886282A (zh) * 2019-02-26 2019-06-14 腾讯科技(深圳)有限公司 对象检测方法、装置、计算机可读存储介质和计算机设备
CN110837811A (zh) * 2019-11-12 2020-02-25 腾讯科技(深圳)有限公司 语义分割网络结构的生成方法、装置、设备及存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642576A (zh) * 2021-08-24 2021-11-12 凌云光技术股份有限公司 一种目标检测及语义分割任务中训练图像集合的生成方法及装置
CN113642576B (zh) * 2021-08-24 2024-05-24 凌云光技术股份有限公司 一种目标检测及语义分割任务中训练图像集合的生成方法及装置
CN114693934A (zh) * 2022-04-13 2022-07-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN114693934B (zh) * 2022-04-13 2023-09-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN114677514A (zh) * 2022-04-19 2022-06-28 苑永起 一种基于深度学习的水下图像语义分割模型
CN115331245A (zh) * 2022-10-12 2022-11-11 中南民族大学 一种基于图像实例分割的表格结构识别方法
CN115331245B (zh) * 2022-10-12 2023-02-03 中南民族大学 一种基于图像实例分割的表格结构识别方法
CN116645696A (zh) * 2023-05-31 2023-08-25 长春理工大学重庆研究院 一种用于多模态行人检测的轮廓信息引导特征检测方法
CN116645696B (zh) * 2023-05-31 2024-02-02 长春理工大学重庆研究院 一种用于多模态行人检测的轮廓信息引导特征检测方法

Also Published As

Publication number Publication date
US20220051056A1 (en) 2022-02-17
CN110837811A (zh) 2020-02-25
CN110837811B (zh) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2021093435A1 (zh) 语义分割网络结构的生成方法、装置、设备及存储介质
US11775574B2 (en) Method and apparatus for visual question answering, computer device and medium
WO2021238281A1 (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2022105125A1 (zh) 图像分割方法、装置、计算机设备及存储介质
US11983903B2 (en) Processing images using self-attention based neural networks
JP2023541532A (ja) テキスト検出モデルのトレーニング方法及び装置、テキスト検出方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム
CN111079532A (zh) 一种基于文本自编码器的视频内容描述方法
JP7286013B2 (ja) ビデオコンテンツ認識方法、装置、プログラム及びコンピュータデバイス
US11768876B2 (en) Method and device for visual question answering, computer apparatus and medium
CN113344206A (zh) 融合通道与关系特征学习的知识蒸馏方法、装置及设备
US20220327816A1 (en) System for training machine learning model which recognizes characters of text images
US20230009547A1 (en) Method and apparatus for detecting object based on video, electronic device and storage medium
CN113869138A (zh) 多尺度目标检测方法、装置及计算机可读存储介质
WO2023207778A1 (zh) 数据修复方法、装置、计算机及可读存储介质
CN112949477A (zh) 基于图卷积神经网络的信息识别方法、装置及存储介质
CN111461181B (zh) 一种车辆细粒度分类方法及装置
US20220270341A1 (en) Method and device of inputting annotation of object boundary information
US20230316536A1 (en) Systems and methods for object tracking
WO2022222854A1 (zh) 一种数据处理方法及相关设备
CN117690098B (zh) 一种基于动态图卷积的开放驾驶场景下多标签识别方法
Lv et al. Memory‐augmented neural networks based dynamic complex image segmentation in digital twins for self‐driving vehicle
CN114332894A (zh) 图像的文本检测方法及装置
CN112364933A (zh) 图像分类方法、装置、电子设备和存储介质
US20230027813A1 (en) Object detecting method, electronic device and storage medium
CN113628107A (zh) 人脸图像超分辨率方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20887478

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20887478

Country of ref document: EP

Kind code of ref document: A1