CN111785085B - Visual perception and perception network training method, device, equipment and storage medium - Google Patents

Visual perception and perception network training method, device, equipment and storage medium Download PDF

Info

Publication number
CN111785085B
CN111785085B CN202010530027.7A CN202010530027A CN111785085B CN 111785085 B CN111785085 B CN 111785085B CN 202010530027 A CN202010530027 A CN 202010530027A CN 111785085 B CN111785085 B CN 111785085B
Authority
CN
China
Prior art keywords
image
perception
network
training
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010530027.7A
Other languages
Chinese (zh)
Other versions
CN111785085A (en
Inventor
周彬
刘宗岱
赵沁平
吴洪宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010530027.7A priority Critical patent/CN111785085B/en
Publication of CN111785085A publication Critical patent/CN111785085A/en
Priority to US17/199,338 priority patent/US11875546B2/en
Application granted granted Critical
Publication of CN111785085B publication Critical patent/CN111785085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • G08G1/168Driving aids for parking, e.g. acoustic or visual feedback on parking space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units, or advanced driver assistance systems for ensuring comfort, stability and safety or drive control systems for propelling or retarding the vehicle
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/09Taking automatic action to avoid collision, e.g. braking and steering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • G06F18/21343Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis using decorrelation or non-stationarity, e.g. minimising lagged cross-correlations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/653Three-dimensional objects by matching three-dimensional models, e.g. conformal mapping of Riemann surfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • G08G1/167Driving aids for lane monitoring, lane changing, e.g. blind spot detection
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2420/00Indexing codes relating to the type of sensors based on the principle of their operation
    • B60W2420/40Photo or light sensitive means, e.g. infrared sensors
    • B60W2420/403Image sensing, e.g. optical camera
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • G06T2207/30261Obstacle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2004Aligning objects, relative positioning of parts

Abstract

The application provides a visual perception method and a perception network training method, device, equipment and storage medium. The pose perception of the moving part of the perception target is realized, the perception granularity is refined, and the technical effects of analyzing and understanding the local movement of the object are deepened. The method for training the perception network comprises the steps of obtaining image data and model data, generating an edited image according to a two-dimensional image and a three-dimensional model by using a preset editing algorithm, and finally training the perception network to be trained according to the edited image and a label to determine the perception network. The technical effect of generating the training images which are more real and have smaller domain difference is achieved.

Description

Visual perception and perception network training method, device, equipment and storage medium
Technical Field
The present application relates to the field of target detection, and in particular, to a method, an apparatus, a device, and a storage medium for visual perception and perception network training.
Background
Object Detection (Object Detection) is a fundamental problem in the field of computer vision, and among uncontrolled nature scenes, rapidly and accurately locating and identifying a particular Object is an important functional basis for many artificial intelligence application scenarios. Visual perception is one application of object detection.
The existing visual perception technology generally detects on an image through a visual perception system based on a deep neural network to obtain example segmentation, namely, only a perception target can be obtained, and pose estimation is carried out on the whole perception target, which is called as the following conditions in the field: the object is understood shallowly by virtue of the bounding box and the outline.
However, in practical application scenarios, the perception of the target as a whole is still insufficient to enable artificial intelligence to adopt sufficient countermeasures, such as: in the automatic driving scene, when a vehicle stops at the roadside and the door is opened, it is likely that a person will get off the vehicle, but since only the whole vehicle is sensed and the vehicle stops at the roadside, the automatic driving vehicle cannot respond to avoid collision with the person who gets off the vehicle. That is to say, the prior art has the problem that the local motion of the object cannot be analyzed and understood due to overlarge perception granularity.
Disclosure of Invention
The application provides a visual perception and perception network training method, a device, equipment and a storage medium, which aim to solve the problem that in the prior art, the perception granularity is too large, so that the analysis and understanding of the local motion of an object cannot be carried out.
In a first aspect, the present application provides a method of visual perception, comprising:
acquiring an image to be perceived, wherein the image to be perceived comprises at least one target object;
identifying the image to be perceived by using a perception network to determine a perception target and a pose state of the perception target, wherein the perception target is a target object of which the pose state meets preset attributes;
and determining a control instruction according to a preset control algorithm and the pose state so that the object to be controlled determines a processing strategy for the perception target according to the control instruction.
Optionally, the identifying, by using a sensing network, the image to be sensed to determine a sensing target and a pose state of the sensing target includes:
performing feature extraction on the image to be perceived to determine features of the image to be perceived;
classifying the features with a classifier to determine the perception target;
determining the pose state of the perception target using a regression subnetwork.
In one possible design, the determining the pose state of the perceptual target using a regression subnetwork includes:
determining a moving part matched with the perception target in a preset database;
determining the state probability of the moving component according to the moving component and the standard state corresponding to the moving component by utilizing a regression subnetwork;
determining the pose state of the perception target according to the state probability, wherein the pose state comprises a state vector.
Optionally, the perception target includes: a vehicle.
In one possible design, determining a control instruction according to a preset control algorithm and the pose state includes:
and determining the control instruction according to a preset automatic driving control algorithm and the pose state so as to enable the vehicle to be controlled to decelerate or avoid the perception target according to the control instruction.
Optionally, after the recognizing, by using a sensing network, the image to be sensed to determine a sensing target and a pose state of the sensing target, the method further includes:
and marking the perception target and the pose state on the image to be perceived, and displaying the marked perception image.
In one possible design, the moving part comprises: at least one of a left front door, a left rear door, a right front door, a right rear door, a trunk, and a hood.
The second aspect of the present application provides a method for training a perceptual network, including:
acquiring image data containing a perception target and model data, wherein the image data comprises: a two-dimensional image and an annotation, the model data comprising: a three-dimensional model;
generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm;
and training the perception network to be trained according to the edited image and the label so as to determine the perception network.
In one possible design, the generating, by using a predictive editing algorithm, an edited image from the two-dimensional image and the three-dimensional model includes:
determining a moving part corresponding to the perception target;
extracting a first visible region of the moving part from the two-dimensional image;
and generating the edited image according to the first visible region and the three-dimensional model.
Optionally, the generating the edited image according to the first visible region and the three-dimensional model includes:
determining pose information of the moving part according to the moving part, wherein the pose information is a matrix formed by moving states of the moving part in 6-degree-of-freedom space;
generating a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using a projection model;
determining a second visible area according to the three-dimensional point cloud and the pose information, wherein the second visible area is a visible area of the moving part at the position after the moving;
and generating the edited image according to the second visible region and the three-dimensional model by using a filling algorithm.
Further optionally, the generating the edited image according to the second visible region and the three-dimensional model by using a filling algorithm includes:
aligning the second visible region with the three-dimensional model to determine an invisible region;
determining a filling image of the invisible area according to the three-dimensional model by utilizing a rendering technology;
and overlapping the filling image and the second visible area, and replacing a moving part in the image by using the overlapped image to generate the editing image.
In one possible design, after the determining the second visible region, the method further includes:
and smoothing the second visible region by using a smoothing algorithm.
In a possible design, the training a perceptual network to be trained according to the edited image and the label to determine a perceptual network includes:
respectively extracting the characteristics of the edited image by using a main backbone network and an auxiliary backbone network to determine main characteristics and auxiliary characteristics;
merging the main feature and the auxiliary feature to obtain a correlation feature;
determining a state vector of the moving part according to the correlation features and a regression sub-network;
and training the perception network to be trained according to the state vector and the label to determine the perception network.
Optionally, the training the perception network to be trained according to the state vector and the label to determine the perception network includes:
calculating a cross entropy loss function according to the state vector and the label;
and training the perception network to be trained by utilizing the cross entropy loss function so as to determine the perception network.
In one possible design, before the combining the main feature and the assistant feature to obtain the associated feature, the method further includes:
the main backbone network is configured with a first weight and the auxiliary backbone network is configured with a second weight;
pre-training the main backbone network and the auxiliary backbone network, and determining the first weight and the second weight.
Optionally, the pre-training includes:
acquiring an actual test image and a general detection image;
carrying out perception training on the backbone network by utilizing the actual test image;
and performing perception training on the auxiliary backbone network by using the universal detection image.
Optionally, the main backbone network and the auxiliary backbone network are the same target detection network.
In a third aspect, the present application provides a visual perception device comprising:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an image to be perceived, and the image to be perceived comprises at least one target object;
the processing module is used for identifying the image to be perceived by utilizing a perception network so as to determine a perception target and a pose state of the perception target, wherein the perception target is a target object of which the pose state meets preset attributes;
and the control module is used for determining a control instruction according to a preset control algorithm and the pose state so that the object to be controlled determines a processing strategy for the perception target according to the control instruction.
Optionally, the processing module is configured to identify the image to be perceived by using a perception network to determine a perception target and a pose state of the perception target, and includes:
the processing module is used for extracting the features of the image to be perceived so as to determine the features of the image to be perceived;
the processing module is further configured to classify the features by using a classifier to determine the perception target;
the processing module is further configured to determine the pose state of the perception target by using a regression subnetwork.
In one possible design, the processing module is further configured to determine the pose state of the perceptual target using a regression subnetwork, including:
the processing module is further used for determining a moving part matched with the perception target in a preset database;
the processing module is further configured to determine, by using a regression subnetwork, a state probability of the moving component according to the moving component and a standard state corresponding to the moving component;
the processing module is further configured to determine the pose state of the perceptual target according to the state probability, where the pose state includes a state vector.
In one possible design, the control module is configured to determine a control instruction according to a preset control algorithm and the pose state, and includes:
the control module is used for determining the control instruction according to a preset automatic driving control algorithm and the pose state so that the vehicle to be controlled avoids the perception target according to the control instruction.
Optionally, after the processing module is configured to identify the image to be perceived by using a perception network to determine a perception target and a pose state of the perception target, the processing module further includes:
the processing module is further configured to mark the perception target and the pose state on the image to be perceived, and display the marked perception image.
In a fourth aspect, the present application provides a training apparatus for a perceptual network, comprising:
an obtaining module, configured to obtain image data including a perception target and model data, where the image data includes: a two-dimensional image and an annotation, the model data comprising: a three-dimensional model;
the image editing module is used for generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm;
and the training module is used for training the perception network to be trained according to the edited image and the label so as to determine the perception network.
In one possible design, the image editing module is configured to generate an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm, and includes:
the image editing module is used for determining a moving part corresponding to the perception target;
the image editing module is further configured to extract a first visible region of the moving component from the two-dimensional image;
the image editing module is further configured to generate the edited image according to the first visible region and the three-dimensional model.
In one possible design, the image editing module is further configured to generate the edited image according to the first visible region and the three-dimensional model, and includes:
the image editing module is further configured to determine pose information of the moving component according to the moving component, where the pose information is a matrix formed by motion states of the moving component in a space 6 degree of freedom;
the image editing module is further used for generating a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using a projection model;
the image editing module is further configured to determine a second visible area according to the three-dimensional point cloud and the pose information, where the second visible area is a visible area of the moving component at the position after the moving;
the image editing module is further configured to generate the edited image according to the second visible region and the three-dimensional model by using a filling algorithm.
In one possible design, the image editing module is further configured to generate the edited image according to the second visible region and the three-dimensional model by using a filling algorithm, and includes:
the image editing module is further configured to align the second visible region with the three-dimensional model, and determine an invisible region;
the image editing module is further used for determining a filling image of the invisible area according to the three-dimensional model by utilizing a rendering technology;
the image editing module is further configured to superimpose the filling image and the second visible region, and replace a moving part in the image with the superimposed image, so as to generate the edited image.
Optionally, after determining the second visible region, the image editing module is further configured to:
the image editing module is further configured to perform smoothing processing on the second visible region by using a smoothing processing algorithm.
Optionally, the training module is configured to train a perception network to be trained according to the edited image and the label to determine a perception network, and includes:
the training module is used for respectively extracting the characteristics of the edited image by utilizing a main backbone network and an auxiliary backbone network so as to determine main characteristics and auxiliary characteristics;
the training module is further configured to input the main features and the auxiliary features into the main backbone network and the auxiliary network to obtain associated features, where the main backbone network is configured with a first weight, and the auxiliary backbone network is configured with a second weight;
the training module is further used for determining a state vector of the moving part according to the correlation characteristics and the regression sub-network;
the training module is further configured to train the perception network to be trained according to the state vector and the label to determine the perception network.
In a possible design, the training module is further configured to train the to-be-trained perceptual network according to the state vector and the label to determine the perceptual network, and includes:
the training module is further used for calculating a cross entropy loss function according to the state vector and the label;
the training module is further configured to train the perceptual network to be trained by using the cross entropy loss function to determine the perceptual network.
Optionally, the training module is further configured to, before inputting the main feature and the auxiliary feature into the main backbone network and the auxiliary backbone network to obtain the associated feature, further include:
the training module is further configured to pre-train the main backbone network and the auxiliary backbone network, and determine the first weight and the second weight.
Optionally, the training module is further configured to pre-train the main backbone network and the auxiliary backbone network, and includes:
the acquisition module is also used for acquiring an actual test image and a general detection image;
the training module is also used for carrying out perception training on the backbone network by utilizing the actual test image;
the training module is further configured to perform sensing training on the auxiliary backbone network by using the general detection image.
In a fifth aspect, the present application provides an electronic device, comprising:
a memory for storing program instructions;
and the processor is used for calling and executing the program instructions in the memory to execute any one of the possible visual perception methods provided by the first aspect.
In a sixth aspect, the present application provides an electronic device, comprising:
a memory for storing program instructions;
and the processor is used for calling and executing the program instructions in the memory and executing any one possible network-aware training method provided by the second aspect.
In a seventh aspect, the present application provides a storage medium, wherein a computer program is stored in the storage medium, and the computer program is used to execute any one of the possible visual perception methods provided in the first aspect.
In an eighth aspect, the present application provides a storage medium, wherein the readable storage medium stores a computer program, and the computer program is configured to execute any one of the possible perceptual network training methods provided in the second aspect.
The application provides a visual perception method and a perception network training method, device, equipment and storage medium. The pose perception of the moving part of the perception target is realized, the perception granularity is refined, and the technical effects of analyzing and understanding the local movement of the object are deepened. The method for training the perception network comprises the steps of obtaining image data and model data, generating an edited image according to a two-dimensional image and a three-dimensional model by using a preset editing algorithm, and finally training the perception network to be trained according to the edited image and a label to determine the perception network. The technical effect of generating the training images which are more real and have smaller domain difference is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a schematic flow chart of a visual perception method provided in the present application;
FIG. 2 is a schematic flow chart of another visual perception method provided herein;
fig. 3a to 3f are schematic views of application scenarios of a visual perception method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a method for training a perceptual network according to the present application;
FIGS. 5a-5h are diagrams of edited image samples provided by embodiments of the present application;
fig. 6 is a schematic flowchart of another method for training a perceptual network according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an edited image generation process provided in an embodiment of the present application;
fig. 8 is a schematic flowchart of a further method for training a perceptual network according to an embodiment of the present application;
fig. 9 is a data flow structure diagram of perceptual network training provided in an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a visual perception device provided by the present application;
fig. 11 is a schematic structural diagram of a perceptual network training apparatus provided in the present application;
fig. 12 is a schematic structural diagram of a visual perception electronic device provided in the present application
Fig. 13 is a schematic structural diagram of an electronic device for perceptual network training provided in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, including but not limited to combinations of embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any inventive step are within the scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The existing visual perception technology generally performs detection on an image through a visual perception system based on a deep neural network to obtain integral example segmentation, that is, only an integral perception target can be obtained, and pose estimation is performed on the whole perception target, which is called as the following conditions in the field: the object is understood shallowly by virtue of the bounding box and the outline.
However, in practical application scenarios, the perception of the target as a whole is still insufficient to provide enough information for artificial intelligence to take the right countermeasures, such as: in the automatic driving scene, when a vehicle stops at the roadside and the door is opened, it is likely that a person will get off the vehicle, but since only the whole vehicle is sensed and the vehicle stops at the roadside, the automatic driving vehicle cannot respond to avoid collision with the person who gets off the vehicle. That is to say, the prior art has the problem that the local motion of the object cannot be analyzed and understood due to overlarge perception granularity.
In order to solve the above problems, the present invention is based on the idea that a sensing model capable of sensing finer granularity is used to perform sensing identification on a target, which leads to a new problem, how to effectively divide the granularity is not beneficial to fully analyzing and understanding the sensing target because the granularity is larger, the sensing process is more complicated when the granularity is too small, the sensing process takes too long, the cost is higher, and how to select an appropriate granularity is the first problem faced by the present inventors. The inventor of the present application finds, through long-term and extensive creative work in practice, that the division of the movable portion of the perception object to the component level is a relatively appropriate granularity, for example, a hand and a foot of a person are a relatively appropriate granularity with respect to the whole person, and the granularity is too small if the division is performed to fingers. For example, a car is classified into a door or a trunk lid, which is also an appropriate size, but when the classification is performed into a door hinge, the trunk support rod is classified into a smaller size.
The problem of the size of the divided granularity is solved, and then how to obtain the perception network suitable for perceiving the moving part level is solved.
The present application provides a method, an apparatus, an electronic device and a storage medium for visual perception and perception network training, which are described in the following embodiments.
In order to facilitate understanding of the whole invention of the present application, the following disclosure first introduces how to use the sensing network to perform sensing identification on a target when the sensing network has a level of sensing a moving part enough, so as to solve the problem that the sensing granularity is too large to correctly analyze and understand the sensing target and cannot provide enough decision information for a control system. Then, how the perception network of the application is obtained through targeted training is described, and a method for accelerating the generation of the training image and obtaining the training image with smaller domain difference and closer to the actual situation is creatively provided.
Fig. 1 is a schematic flow chart of a visual perception method provided in the present application, and as shown in fig. 1, the method specifically includes the steps of:
and S101, acquiring an image to be perceived.
In this step, an image containing the target to be sensed, for example, an image captured by a camera, or an image captured from a surveillance video, is received/acquired from an external database or an external input interface.
S102, identifying the image to be perceived by using a perception network so as to determine a perception target and a pose state of the perception target.
In this step, a perception network based on a neural network algorithm is used to extract features of an image to be perceived, and then the obtained features are classified and identified. Therefore, through feature extraction in the graph to be perceived, the perception network firstly confirms the whole of the perception target, and then performs state recognition on the moving parts in the whole perception target, for example: the perception network firstly perceives the whole human body in the image, then the motion part division is carried out on the characteristics of the whole human body, such as the hand of the human body, then the motion state of the hand is recognized, the state of the hand can be defined as three pose states of lifting, leveling and putting down, and the perception network can provide a determined state mark, namely the pose state of the perception target through the recognition of the hand. It should be further noted that this gesture state is a trigger that has been defined for the sensing target in the sensing network and corresponds to a preset attribute. For example, the human hand is lifted, and the preset attribute defined in the perception network is that the perception object provides an interception instruction.
S103, determining a control instruction according to a preset control algorithm and the pose state, so that the object to be controlled determines a processing strategy for the perception target according to the control instruction.
The object to be controlled is, for example, an intelligent driving automobile, and when the position and posture state of a pedestrian on the roadside is recognized as being lifted by a hand by a sensing network, an interception instruction is provided by the sensing object according to a preset attribute of the lifted hand, and at this time, a preset control algorithm generates an automobile braking control instruction, so that the automobile is braked and stopped automatically.
It should be noted that the embodiment is not limited to the above-mentioned example of the automobile brake, and a person skilled in the art may apply the method to a specific scenario according to the meaning of the step, and the embodiment does not limit the application scenario.
The embodiment provides a visual perception method, which is characterized in that a perception network is utilized to identify an acquired image to be perceived so as to determine a perception target and a pose state of the perception target, and finally a control instruction is determined according to a preset control algorithm and the pose state, so that an object to be controlled determines a processing strategy for the perception target according to the control instruction. The pose perception of the moving part of the perception target is realized, the perception granularity is refined, and the technical effects of analyzing and understanding the local movement of the object are deepened.
Fig. 2 is a schematic flow chart of another visual perception method provided in the present application, as shown in fig. 2, the method includes the specific steps of:
s201, obtaining an image to be perceived.
It should be noted that, in order to facilitate those skilled in the art to understand a specific implementation manner of the visual perception method of the present application, an application scenario of the present embodiment is that an intelligent driving automobile including the visual perception method of the present embodiment performs perception and identification on other vehicles encountered on a road during driving. A person skilled in the art can select a specific application scenario by analogy with the implementation of the visual perception method of the embodiment, and is not limited to the application scenario described in the embodiment.
In this step, the real road condition map at the current moment, namely the image to be perceived, is shot by the front camera of the intelligent driving automobile.
Fig. 3a to 3f are schematic application scenarios of a visual perception method according to an embodiment of the present application. As shown in fig. 3a, 3c and 3e, the intelligent driving vehicle runs on the urban road, and the front-facing camera collects road condition images in real time.
S202, extracting the features of the image to be perceived to determine the features of the image to be perceived.
In this step, the neural network-based perception network performs feature extraction on the acquired road condition image, specifically, convolves the image through multiple layers of convolution layers to obtain a feature map or a feature vector of the image.
And S203, classifying the features by using a classifier to determine a perception target.
In this step, the image features are classified by a classifier, and the classification criterion is whether the image features satisfy the vehicle feature attribute, so as to determine the vehicle in the image to be perceived, i.e. the road condition image. As shown in fig. 3b, if there are multiple sensing targets in the image to be sensed, the classifier should classify and distinguish the objects included in the image one by one to find all the sensing targets satisfying the vehicle characteristics.
And S204, determining the moving parts matched with the perception targets in the preset database.
In this step, the moving part partitions corresponding to different perception targets are stored in a preset database. The granularity which is fine enough for analyzing and understanding the perception target is divided in advance, and because the division granularity of different perception targets can be different, the moving parts corresponding to the perception targets need to be preset and stored in the database, or the moving parts and the database play the same function in a storage unit, or the moving parts and the perception targets are directly bound into a data whole to be stored. In this embodiment, for the perception target vehicle, the moving parts thereof may be divided to include: at least one of a left front door, a left rear door, a right front door, a right rear door, a trunk lid, and a hood.
And S205, determining the state probability of the moving component according to the moving component and the standard state corresponding to the moving component by using the regression sub-network.
In this step, each moving part has its corresponding standard state, for example, the vehicle door can be set to open or close in its standard state, the set state flag is 1 in the open state, and the set state flag is in the closed state. And performing state detection on all moving parts corresponding to the perception target one by utilizing a regression subnetwork, specifically, performing deep convolution on the characteristics through a convolution algorithm, and then performing normalization processing to obtain the state probability of a value interval within a range of [0,1 ].
And S206, determining the pose state of the perception target according to the state probability.
In this step, the sensing probability may be compared with a preset state determination threshold, for example, if the state probability of the open door is greater than or equal to the state determination threshold, the open door state is determined to be the open door state, and the door state in the pose state of the vehicle is set to be 1. It can be understood that when only one motion component of the perception target exists, the pose state is a binary state quantity, namely, the value is 0 or 1.
Optionally, as shown in fig. 3b, 3d, and 3f, after determining the pose state of the sensing target, the sensing target and the pose state corresponding to the sensing target may be marked on the image to be sensed, and the specific implementation manner may be that the outer contour of the sensing target is marked with a curve frame, the moving component conforming to a specific state is also marked with a curve frame of a different color, and the moving state description of the moving component at that time is attached.
And S207, determining a control instruction according to a preset control algorithm and the pose state, so that the object to be controlled determines a processing strategy for the perception target according to the control instruction.
In this step, as shown in fig. 3b, if the regression sub-network determines that the right door 311 of the vehicle 31 is in the open state corresponding to the vehicle 31 in the rectangular frame in the figure, which is the sensing target, according to the preset processing strategy, when the door is in the open state, there may be a person getting off suddenly, and accordingly the smart driving vehicle should adopt the processing strategy of deceleration or avoidance, the corresponding preset control algorithm sends a control instruction of deceleration or avoidance to the smart driving vehicle. And the intelligent driving vehicle performs the operation of decelerating or avoiding the perception target according to the instruction.
It should be noted that the perception network described in this embodiment may be a neural network that performs a single visual perception task, or may be a combination of neural networks that can perform multiple tasks simultaneously.
Specifically, the cognitive network of the present embodiment includes: a main backbone network, an auxiliary backbone network and a multitasking sub-network;
the main backbone network and the auxiliary backbone network are used for extracting the characteristics of the image to be perceived;
multitasking subnetworks are used to perform a variety of different tasks including, but not limited to:
recognizing a perception target;
performing example segmentation on the perception target in the image to be perceived, such as performing segmentation display on the outer contour range of the perception vehicle in fig. 3 d;
determining and marking a state vector of a perception target;
segmenting the moving part of the perception target, such as segmenting and displaying the outline of the trunk lid in fig. 3 f;
and labeling the perception object with a class bounding box, such as the boxes in fig. 3b, fig. 3d, and fig. 3f, which is the class bounding box.
The embodiment provides a visual perception method, which is characterized in that a perception network is utilized to identify an acquired image to be perceived so as to determine a perception target and a pose state of the perception target, and finally a control instruction is determined according to a preset control algorithm and the pose state, so that an object to be controlled determines a processing strategy for the perception target according to the control instruction. The pose perception of the moving part of the perception target is realized, the perception granularity is refined, and the technical effects of analyzing and understanding the local movement of the object are deepened.
How to train the perception network in a targeted manner by the perception network training method of the application is introduced below, so that the perception granularity of the perception network reaches the level of a moving part, a perception target can be accurately analyzed, and the perception process can be guaranteed without being too complex and time-consuming.
Fig. 4 is a schematic flowchart of a method for training a perceptual network provided in the present application, and as shown in fig. 4, the method specifically includes the steps of:
s401, obtaining image data containing a perception target and model data, wherein the image data comprises: two-dimensional images and annotations, the model data comprising: a three-dimensional model.
In this step, a typical perceptual object in the neural network target detection algorithm is selected, for example: the images of the vehicle, the person, the animal and the like are respectively selected for different perception targets, and the images are independent or have preset combinations, such as the image of only one vehicle alone, the image of a plurality of vehicles, or the image of a mixture of people and vehicles. And then manually labeling the images, wherein the labeled contents comprise: the type of the perception target, the motion component corresponding to the perception target and the state thereof, and the like. The image and the annotation constitute image data. Then establishing model data for the corresponding perception target, wherein the model data comprises a three-dimensional model diagram and can also comprise: the present embodiment does not limit the model data, and the model data is used to display a portion that is not seen from the viewing angle of a two-dimensional image, such as an image of a running vehicle, where only the outer surface of a closed door is visible in the two-dimensional image, but the shape of one side inside the vehicle is not displayed, and this portion needs to be supplemented by the model data. In this step, image data and model data that meet the above requirements are acquired.
S402, generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm.
In this step, the moving part of the perception target is first cut out from the two-dimensional image, and the position of the moving part in a new state is confirmed according to the motion track of the moving part. For example: the hand is a moving part, only the state of the hand when the hand is put down is found in the two-dimensional image, the hand part in the two-dimensional image can be intercepted by the preset editing algorithm, then the hand rotates around the shoulder, and the hand rotates by 90 degrees to reach the state that the hand is flatly lifted. Because the original hands in the two-dimensional image are not lifted, image deletion is inevitably caused, and the representation is not real enough. In order to solve the problem, in the step, a three-dimensional model is introduced, and the preset editing algorithm supplements the image of the missing part with the three-dimensional model, so that a state image when the human hand is flat, namely an editing image, can be formed.
It should be noted that in the prior art, the training images are obtained by re-acquiring the images in different states, which greatly increases the resources and time consumed by the training data preparation work. And in order to make the data volume enough, the practice of the prior art has no value for realization. Another way of the prior art is to generate a large number of state pictures completely by using a CG (Computer Graphics) technology, and pictures generated by such a CG technology usually have a huge domain difference from a real scene in appearance, and it is difficult to train a high-performance deep network model. Compared with the two prior arts, the generation method of the edited image provided by the embodiment of the application combines the motion of the three-dimensional model and the rendering effect of the environment and the perception target on the basis of the two-dimensional image of the real scene to combine the pictures corresponding to different pose states, thereby ensuring that the training image is close to the real scene enough and solving the problem that the pictures in enough states cannot be acquired manually.
Fig. 5a-5h are diagrams of editing image samples according to embodiments of the present application. As shown in fig. 5a to 5h, the original two-dimensional maps of all vehicles are in a normal state, that is, the doors, the hood, and the trunk are all in a closed state, and edited pictures of the states of the doors, the hood, and the trunk when they are opened are generated through a preset editing algorithm. Because only a part of the three-dimensional model is needed and all images which are not like CG images need a large amount of calculation, the generation speed of the edited images is improved.
And S403, training the perception network to be trained according to the edited image and the label to determine the perception network.
In this step, the edited image generated in the previous step is used, and a neural network training method is adopted to train the perception network to be trained, so that the perception network with perception object motion component level perception granularity can be obtained. Training methods include, but are not limited to: gradient descent, newton's algorithm, conjugate gradient method, quasi-newton method, and levenberg-marquardt (levenberg-marquardt) algorithm.
The embodiment of the application provides a method for training a perception network, which comprises the steps of obtaining image data and model data of a perception target, simulating pose states of moving parts of various perception targets by using the model data on the basis of a two-dimensional image by using a preset editing algorithm, combining and editing to generate an edited image, and finally training the perception network to be trained by using the edited image to obtain the perception network with perception object moving part level perception granularity. The technical effects of generating the training images more quickly, truer and having smaller domain difference are achieved.
Fig. 6 is a flowchart illustrating another method for training a perceptual network according to an embodiment of the present application. The embodiment describes a specific implementation manner of the perceptual network training method according to a specific application scenario of generating an open state of a right front door or an open state of a trunk lid of a vehicle. As shown in fig. 6, the method includes the following specific steps:
s601, obtaining image data and model data containing a perception target.
Fig. 7 is a schematic diagram of an edited image generating process according to an embodiment of the present application. As shown in fig. 7, first, a two-dimensional image including a perception target is obtained, in this embodiment, the perception target is a vehicle, and the two-dimensional image of the vehicle is labeled, where the labeling content includes: the type of the perception target, namely the vehicle, the motion part corresponding to the perception target, namely a right front door or a trunk lid and the state of the right front door or the trunk lid, is fully opened, the coordinate of a right door rotating shaft, the six-degree-of-freedom pose of the vehicle and the like. Then, a three-dimensional model of the vehicle is acquired as model data.
And S602, determining a motion part corresponding to the perception target.
In this step, the moving part of the vehicle may be at least one of a left front door, a left rear door, a right front door, a right rear door, a trunk, and a hood, and particularly, in the present embodiment, the right front door. Since the pose states or the movement trajectories of different moving parts are different, confirmation of the moving parts is required.
And S603, extracting a first visible area of the moving part from the two-dimensional image.
After the moving part, namely the front right door, is determined, a screenshot representing the front right door is extracted from the two-dimensional image as a first visible area.
And S604, determining the pose information of the moving part according to the moving part.
And if the moving part is determined to be the right front door, the pose information of the six degrees of freedom corresponding to the right front door comprises: the desired over-angle of rotation, direction of rotation, final position, etc. at full opening.
And S605, generating a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using the projection model.
In this step, the pose information of the moving part and the object with six degrees of freedom is projected by the pinhole camera, i.e. the projection model, and the part region in the two-dimensional image can be reconstructed into the three-dimensional point cloud P by using the following formula (1), wherein the formula (1) is as follows:
Figure BDA0002535037210000171
wherein D is a depth map, RgAnd tgRepresenting the pose information of the moving part, namely the global pose, K is the camera internal reference,
Figure BDA0002535037210000172
a homogeneous vector of image pixel coordinates.
And S606, determining a second visible area according to the three-dimensional point cloud and the pose information.
Assuming the rotation matrix of the object about the rotation axis is, we calculate the new projected pixel coordinates u' using the following equation (2), equation (2) is as follows:
Figure BDA0002535037210000173
wherein R isgAnd tgAnd representing the pose information of the moving part, namely the global pose, and K is the camera internal reference.
u' may represent the second visible area. The processing procedures shown in (c) part reconstruction, (d) part three-dimensional motion, (e) two-dimensional projection in fig. 7.
Optionally, after the second visible region is determined, there may be some holes in the second visible region, and the distribution of the pixels is not uniform, at this time, hole completion may be performed by using linear neighbor interpolation, and then, the second visible region is subjected to smoothing processing by using a bilateral filtering algorithm, as shown in (f) image post-processing optimization in fig. 7. In this embodiment, the linear neighbor interpolation and bilateral filtering algorithm is a smoothing algorithm, and it can be understood that the smoothing algorithm is not specifically limited in this application, and as long as the algorithm capable of implementing the hole completion and the smoothing is within the range stated in this application, those skilled in the art can select an appropriate implementation manner according to specific situations.
And S607, aligning the second visible area with the three-dimensional model, and determining the invisible area.
And aligning the three-dimensional model area of the moving part with the second visible area by moving in the pose information, so that the invisible area can be obtained.
And S608, determining a filling image of the invisible area according to the three-dimensional model by using a rendering technology.
And according to the outline range and the shape of the invisible area, a filling image of the invisible area can be obtained by using the environment rendering and the rendering graph of the perception target in the three-dimensional model data. The generation of the filler image is achieved by rendering (g) the environment map to (h) the three-dimensional part as in fig. 7.
And S609, overlapping the filling image and the second visible area, and replacing a moving part in the two-dimensional image by using the overlapped image to generate the editing image.
And (3) overlapping the filling image and the second visible area as two editing images of the output result in the figure 7, and replacing a moving part, namely a trunk lid or a right front door, in the two-dimensional image by using the overlapped images to finally obtain an editing image with a moving part posture mark for training.
S610, training the perception network to be trained according to the edited image and the label to determine the perception network.
The detailed description of this step refers to S403 in the embodiment shown in fig. 4, and is not repeated here.
In the perceptual network training method provided by this embodiment, based on an image editing technology guided by a three-dimensional moving component, a component-level CAD three-dimensional model aligned with an object in a two-dimensional image is used to guide a two-dimensional component region to perform reasonable movement and change, so that the two-dimensional object in the image exhibits different states, and corresponding annotation information is automatically generated. Compared with a method for manually collecting the state image and the label of the object, the method has the advantages that a large number of images are automatically generated based on the existing data, the images cover all states of the object, and meanwhile, the corresponding label information is also automatically generated. On the other hand, compared with a method for rendering by a CG technology, the method recorded in the present embodiment generates a more realistic image, greatly alleviates the problem of domain differences, and generates faster.
Fig. 8 is a flowchart illustrating a further method for training a perceptual network according to an embodiment of the present application. As shown in fig. 8, the method includes the following specific steps:
s801, image data and model data containing a perception target are obtained.
And S802, generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm.
For specific descriptions of S801 to S802 in this embodiment, refer to S601 to S609 in the embodiment shown in fig. 6, which are not described herein again.
And S803, respectively performing feature extraction on the edited image by using the main backbone network and the auxiliary backbone network to determine main features and auxiliary features.
In this step, the edited image is subjected to feature extraction using the main backbone network, and the edited image is also subjected to feature extraction using the auxiliary backbone network. It can be understood that the present embodiment does not limit the sequence of extracting features from two backbone networks.
It should be further noted that the main backbone network and the auxiliary backbone network may be two identical neural network algorithms or two different neural network algorithms.
The features of the edited image extracted by the backbone network are main features, and the features of the edited image extracted by the auxiliary backbone network are auxiliary features. The specific way to extract the features is to perform convolution on each convolution layer in the neural network, which is not described herein again.
And S804, combining the main characteristic and the auxiliary characteristic to obtain the associated characteristic.
Fig. 9 is a data flow structure diagram of perceptual network training provided in an embodiment of the present application. As shown in fig. 9, the main feature and the assistant feature are combined to form the associated feature. Specifically, the main feature matrix and the assistant feature matrix may be combined into a correlation feature matrix.
Optionally, before the step of S804, the method may further include:
the method comprises the steps of pre-training a main backbone network by utilizing a real image to be sensed containing a certain specific sensing object or scene, and obtaining a first weight corresponding to the main backbone network, wherein the first weight is a neural network parameter of the main backbone network.
For the auxiliary backbone network, a general detection task is required to be used for pre-training, the general detection task refers to a set of preset training methods used for most of neural networks, and a second weight can be obtained by pre-training the auxiliary backbone network, wherein the second weight is a neural network parameter of the auxiliary backbone network.
The two are combined, so that the generalization performance of the perception network can be improved, and the perception network can perceive more different types of perception targets.
It should be noted that, after the pre-training is finished, both the first weight and the second weight are frozen, that is, in the subsequent training process, the values of both the first weight and the second weight are not changed.
And S805, determining the state vector of the moving part according to the associated features and the regression sub-network.
The associated features are input into a regression subnetwork, which generates a state vector of the moving part of the perception object. The state vector is a corresponding value for the state of the moving part, e.g., with 0 for closed and 1 for open, then for the right front door and trunk lid of the vehicle, the constituent state vector may be [0,1] for the right front door closed while the trunk lid is open.
And S806, calculating a cross entropy loss function according to the state vector and the label.
And comparing the state vector obtained in the last step with the manually marked state, and determining the difference between the state vector and the marked state, namely a cross entropy loss function, through a cross entropy loss function algorithm.
S807, training the perception network to be trained by utilizing the cross entropy loss function to determine the perception network.
And (4) performing reverse propagation on the cross entropy loss function, completing the training of the perception network through repeated iteration for a plurality of times, and determining the perception network.
It should be further noted that, as shown in fig. 9, the sensing network described in the present application is a multitasking neural network, and can simultaneously perform detection, instance segmentation, state description, and motion component segmentation on the sensing target. The sensing of the sensing target means that the sensing target can be identified and surrounded by a box, namely a category surrounding box. Example segmentation means that a detected perception target can be selected out in an image to be perceived by using a curved frame. The motion part is divided, namely the motion part can be selected out in the image to be perceived by a curve frame. The state description means that the state of the moving part or the perception object can be displayed by preset characters. As shown in fig. 3b, fig. 3d, and fig. 3f, after the perception network perceives the perception target, the perception network labels the corresponding multitask result on the picture to be perceived.
According to the training method for the perception network, after the edited image is generated, a mode of a double-path backbone network is introduced, the perception accuracy of the perception network on a specific perception target is improved by the backbone network, the generalization performance of the perception network, namely, more types of perception targets can be kept at higher perception accuracy by the auxiliary backbone network, and finally the perception granularity of the perception target is further refined by the regression sub-model, the pose state of a moving part of the perception target is accurately identified, and the analysis and understanding ability of the perception network on the perception target is deepened.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments can be implemented by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps including the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 10 is a schematic structural diagram of a visual perception device provided in the present application. The visual perception means may be implemented by software, hardware or a combination of both.
As shown in fig. 10, the visual perception device 1000 provided by the present embodiment includes:
an obtaining module 1001, configured to obtain an image to be perceived, where the image to be perceived includes at least one target object;
the processing module 1002 is configured to identify the image to be perceived by using a perception network, so as to determine a perception target and a pose state of the perception target, where the perception target is a target object whose pose state meets a preset attribute;
the control module 1003 is configured to determine a control instruction according to a preset control algorithm and the pose state, so that the object to be controlled determines a processing strategy for the sensing target according to the control instruction.
Optionally, the processing module 1002 is configured to identify the image to be perceived by using a perception network to determine a perception target and a pose state of the perception target, and includes:
the processing module 1002 is configured to perform feature extraction on the image to be perceived to determine features of the image to be perceived;
the processing module 1002 is further configured to classify the features by using a classifier to determine the perception target;
the processing module 1002 is further configured to determine the pose state of the perception target by using a regression sub-network.
In one possible design, the processing module 1002 is further configured to determine the pose state of the perceptual target using a regression subnetwork, including:
the processing module 1002 is further configured to determine a moving component in a preset database, where the moving component is matched with the sensing target;
the processing module 1002 is further configured to determine, by using a regression subnetwork, a state probability of the moving component according to the moving component and a standard state corresponding to the moving component;
the processing module 1002 is further configured to determine the pose state of the sensing target according to the state probability, where the pose state includes a state vector.
In one possible design, the control module 1003 is configured to determine a control instruction according to a preset control algorithm and the pose state, and includes:
the control module 1003 is configured to determine the control instruction according to a preset automatic driving control algorithm and the pose state, so that the vehicle to be controlled avoids the perception target according to the control instruction.
Optionally, after the processing module 1002 is configured to identify the image to be perceived by using a perception network to determine a perception target and a pose state of the perception target, the method further includes:
the processing module 1002 is further configured to mark the perception target and the pose state on the image to be perceived, and display the marked perception image.
It should be noted that the visual perception device provided in the embodiment shown in fig. 10 can execute a visual perception method provided in any one of the above method embodiments, and the specific implementation principle, technical features, technical term explanation and technical effects thereof are similar and will not be described herein again.
Fig. 11 is a schematic structural diagram of a perceptual network training apparatus provided in the present application. The visual perception means may be implemented by software, hardware or a combination of both.
As shown in fig. 11, the perceptual network training apparatus 1100 according to this embodiment includes:
an obtaining module 1101, configured to obtain image data including a perception target and model data, where the image data includes: a two-dimensional image and an annotation, the model data comprising: a three-dimensional model;
an image editing module 1102, configured to generate an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm;
a training module 1103, configured to train the perceptual network to be trained according to the edited image and the label, so as to determine the perceptual network.
In one possible design, the image editing module 1102 is configured to generate an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm, and includes:
the image editing module 1102 is configured to determine a moving component corresponding to the perception target;
the image editing module 1102 is further configured to extract a first visible region of the moving component from the two-dimensional image;
the image editing module 1102 is further configured to generate the edited image according to the first visible region and the three-dimensional model.
In one possible design, the image editing module 1102 is further configured to generate the edited image according to the first visible region and the three-dimensional model, and includes:
the image editing module 1102 is further configured to determine pose information of the moving component according to the moving component, where the pose information is a matrix formed by motion states of the moving component in a space 6 degree of freedom;
the image editing module 1102 is further configured to generate a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using a projection model;
the image editing module 1102 is further configured to determine a second visible area according to the three-dimensional point cloud and the pose information, where the second visible area is a visible area of the moving component at the position after the moving;
the image editing module 1102 is further configured to generate the edited image according to the second visible region and the three-dimensional model by using a filling algorithm.
In one possible design, the image editing module 1102 is further configured to generate the edited image according to the second visible region and the three-dimensional model by using a filling algorithm, and includes:
the image editing module 1102 is further configured to align the second visible region with the three-dimensional model, and determine an invisible region;
the image editing module 1102 is further configured to determine a filling image of the invisible area according to the three-dimensional model by using a rendering technology;
the image editing module 1102 is further configured to overlay the filler image with the second visible area, and replace a moving component in the image with the overlaid image to generate the edited image.
Optionally, the image editing module 1102 is configured to, after determining the second visible region, further include:
the image editing module 1102 is further configured to perform smoothing processing on the second visible region by using a smoothing processing algorithm.
Optionally, the training module 1103 is configured to train the perceptual network to be trained according to the edited image and the label to determine the perceptual network, and includes:
the training module 1103 is configured to perform feature extraction on the edited image by using a main backbone network and an auxiliary backbone network, respectively, to determine a main feature and an auxiliary feature;
the training module 1103 is further configured to input the primary features and the auxiliary features into the primary backbone network and the auxiliary network to obtain associated features, where the primary backbone network is configured with a first weight, and the auxiliary backbone network is configured with a second weight;
the training module 1103 is further configured to determine a state vector of the moving component according to the associated features and a regression sub-network;
the training module 1103 is further configured to train the to-be-trained perceptual network according to the state vector and the label, so as to determine the perceptual network.
In one possible design, the training module 1103 is further configured to train the to-be-trained perceptual network according to the state vector and the label to determine the perceptual network, and includes:
the training module 1103 is further configured to calculate a cross entropy loss function according to the state vector and the label;
the training module 1103 is further configured to train the perceptual network to be trained by using the cross entropy loss function to determine the perceptual network.
Optionally, the training module 1103 is further configured to, before inputting the main features and the auxiliary features into the main backbone network and the auxiliary backbone network to obtain associated features, further include:
the training module 1103 is further configured to pre-train the main backbone network and the auxiliary backbone network, and determine the first weight and the second weight.
Optionally, the training module 1103 is further configured to pre-train the main backbone network and the auxiliary backbone network, including:
the acquiring module 1101 is further configured to acquire an actual test image and a general detection image;
the training module 1103 is further configured to perform perception training on the backbone network by using the actual test image;
the training module 1103 is further configured to perform perceptual training on the auxiliary backbone network by using the general detection image.
It should be noted that the apparatus for training a perceptual network provided in the embodiment shown in fig. 11 may perform the method for training a perceptual network provided in any one of the above method embodiments, and the specific implementation principle, technical features, term interpretation, and technical effects thereof are similar and will not be described herein again.
Fig. 12 is a schematic structural diagram of a visual perception electronic device provided in the present application. As shown in fig. 12, the visual perception electronic device 1200 may include: at least one processor 1201 and memory 1202. Fig. 12 shows an electronic device as an example of a processor.
The memory 1202 stores programs. In particular, the program may include program code including computer operating instructions.
Memory 1202 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 1201 is configured to execute computer-executable instructions stored in the memory 1202 to implement the visual perception methods described in the above method embodiments.
The processor 1201 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present disclosure.
Alternatively, the memory 1202 may be separate or integrated with the processor 1201. When the memory 1202 is a device separate from the processor 1201, the visual perception electronic device 1200 may further include:
a bus 1203 for connecting the processor 1201 and the memory 1202. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. Buses may be classified as address buses, data buses, control buses, etc., but do not represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 1202 and the processor 1201 are implemented integrally on a single chip, the memory 1202 and the processor 1201 may communicate through an internal interface.
Fig. 13 is a schematic structural diagram of an electronic device for perceptual network training provided in the present application. As shown in fig. 13, the perceptual network training electronic device 1300 may include: at least one processor 1301 and memory 1302. Fig. 13 shows an electronic device as an example of a processor.
The memory 1302 stores programs. In particular, the program may include program code including computer operating instructions.
Memory 1302 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Processor 1301 is configured to execute the computer-executable instructions stored in memory 1302 to implement the method for training perceptual networks described in the above method embodiments.
The processor 1301 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present disclosure.
Alternatively, the memory 1302 may be separate or integrated with the processor 1301. When the memory 1302 is a device separate from the processor 1301, the cognitive network training electronic device 1300 may further include:
a bus 1303 for connecting the processor 1301 and the memory 1302. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. Buses may be classified as address buses, data buses, control buses, etc., but do not represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 1302 and the processor 1301 are integrated into a single chip, the memory 1302 and the processor 1301 may communicate through an internal interface.
The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and in particular, the computer-readable storage medium stores program instructions for the visual perception method in the above embodiments.
The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and in particular, the computer-readable storage medium stores program instructions, and the program instructions are used in the method for training the cognitive network in each embodiment described above.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method for training a perceptual network, comprising:
acquiring image data containing a perception target and model data, wherein the image data comprises: a two-dimensional image and an annotation, the model data comprising: a three-dimensional model;
generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm;
training a perception network to be trained according to the edited image and the label to determine the perception network;
generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm, wherein the method comprises the following steps:
determining a moving part corresponding to the perception target;
extracting a first visible region of the moving part from the two-dimensional image;
generating the edited image according to the first visible region and the three-dimensional model;
generating the edited image according to the first visible region and the three-dimensional model, including:
determining pose information of the moving part according to the moving part, wherein the pose information is a matrix formed by moving states of the moving part in 6-degree-of-freedom space;
generating a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using a projection model;
determining a second visible area according to the three-dimensional point cloud and the pose information, wherein the second visible area is a visible area of the moving part at the position after the moving;
generating the edited image according to the second visible region and the three-dimensional model by using a filling algorithm;
generating the edited image according to the second visible region and the three-dimensional model by using a filling algorithm, including:
aligning the second visible region with the three-dimensional model to determine an invisible region;
determining a filling image of the invisible area according to the three-dimensional model by utilizing a rendering technology;
and overlapping the filling image and the second visible area, and replacing a moving part in the two-dimensional image by using the overlapped image to generate the editing image.
2. The method for training the perceptual network according to claim 1, further comprising, after the determining the second visible region:
and smoothing the second visible region by using a smoothing algorithm.
3. The method for training the perception network according to any one of claims 1-2, wherein the training the perception network to be trained according to the edited image and the label to determine the perception network comprises:
respectively extracting the characteristics of the edited image by using a main backbone network and an auxiliary backbone network to determine main characteristics and auxiliary characteristics;
merging the main feature and the auxiliary feature to obtain a correlation feature;
determining a state vector of the moving part according to the correlation features and a regression sub-network;
and training the perception network to be trained according to the state vector and the label to determine the perception network.
4. The method for training the perceptive network according to claim 3, wherein the training the perceptive network to be trained according to the state vector and the label to determine the perceptive network comprises:
calculating a cross entropy loss function according to the state vector and the label;
and training the perception network to be trained by utilizing the cross entropy loss function so as to determine the perception network.
5. The method for training the perceptual network according to claim 3, wherein before the combining the main feature and the auxiliary feature to obtain the associated feature, further comprising:
the main backbone network is configured with a first weight and the auxiliary backbone network is configured with a second weight;
pre-training the main backbone network and the auxiliary backbone network, and determining the first weight and the second weight.
6. The method for training the perceptual network according to claim 5, wherein the pre-training comprises:
acquiring an actual test image and a general detection image;
carrying out perception training on the backbone network by utilizing the actual test image;
and performing perception training on the auxiliary backbone network by using the universal detection image.
7. The network-aware training method of claim 3, wherein the main backbone network and the auxiliary backbone network are the same target detection network.
8. A network-aware training apparatus, comprising:
an obtaining module, configured to obtain image data including a perception target and model data, where the image data includes: a two-dimensional image and an annotation, the model data comprising: a three-dimensional model;
the image editing module is used for generating an edited image according to the two-dimensional image and the three-dimensional model by using a preset editing algorithm;
the training module is used for training the perception network to be trained according to the edited image and the label so as to determine the perception network;
in one possible design, the image editing module is specifically configured to:
determining a moving part corresponding to the perception target;
extracting a first visible region of the moving part from the two-dimensional image;
determining pose information of the moving part according to the moving part, wherein the pose information is a matrix formed by moving states of the moving part in 6-degree-of-freedom space;
generating a three-dimensional point cloud of the first visible area according to the first visible area and the pose information by using a projection model;
determining a second visible area according to the three-dimensional point cloud and the pose information, wherein the second visible area is a visible area of the moving part at the position after the moving;
aligning the second visible region with the three-dimensional model to determine an invisible region;
determining a filling image of the invisible area according to the three-dimensional model by utilizing a rendering technology;
and overlapping the filling image and the second visible area, and replacing a moving part in the image by using the overlapped image to generate the editing image.
9. An electronic device, comprising:
a processor; and the number of the first and second groups,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of training of a perceptual network of any one of claims 1 to 7 via execution of the executable instructions.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of training a perceptual network according to any one of claims 1 to 7.
CN202010530027.7A 2020-06-11 2020-06-11 Visual perception and perception network training method, device, equipment and storage medium Active CN111785085B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010530027.7A CN111785085B (en) 2020-06-11 2020-06-11 Visual perception and perception network training method, device, equipment and storage medium
US17/199,338 US11875546B2 (en) 2020-06-11 2021-03-11 Visual perception method and apparatus, perception network training method and apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010530027.7A CN111785085B (en) 2020-06-11 2020-06-11 Visual perception and perception network training method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111785085A CN111785085A (en) 2020-10-16
CN111785085B true CN111785085B (en) 2021-08-27

Family

ID=72756194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010530027.7A Active CN111785085B (en) 2020-06-11 2020-06-11 Visual perception and perception network training method, device, equipment and storage medium

Country Status (2)

Country Link
US (1) US11875546B2 (en)
CN (1) CN111785085B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205070B (en) * 2021-05-27 2024-02-20 三一专用汽车有限责任公司 Visual perception algorithm optimization method and system
CN114519381A (en) * 2021-12-31 2022-05-20 上海仙途智能科技有限公司 Sensing method and device based on multitask learning network, storage medium and terminal
CN116150520B (en) * 2022-12-30 2023-11-14 联通智网科技股份有限公司 Data processing method, device, equipment and storage medium
CN116861262B (en) * 2023-09-04 2024-01-19 苏州浪潮智能科技有限公司 Perception model training method and device, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2383415B (en) * 2000-09-08 2005-02-23 Automotive Tech Int Vehicle wireless sensing and communication system
US8620026B2 (en) * 2011-04-13 2013-12-31 International Business Machines Corporation Video-based detection of multiple object types under varying poses
US8498448B2 (en) * 2011-07-15 2013-07-30 International Business Machines Corporation Multi-view object detection using appearance model transfer from similar scenes
US10146318B2 (en) * 2014-06-13 2018-12-04 Thomas Malzbender Techniques for using gesture recognition to effectuate character selection
WO2016104800A1 (en) * 2014-12-25 2016-06-30 アイシン・エィ・ダブリュ株式会社 Control device for vehicle drive transmission device
US10789717B2 (en) * 2017-11-24 2020-09-29 Electronics And Telecommunications Research Institute Apparatus and method of learning pose of moving object
CN109949402A (en) * 2017-12-20 2019-06-28 艾迪普(北京)文化科技股份有限公司 Threedimensional model generation method based on image recognition
US10977520B2 (en) * 2018-12-18 2021-04-13 Slyce Acquisition Inc. Training data collection for computer vision
CN109886100A (en) * 2019-01-14 2019-06-14 苏州工业园区职业技术学院 A kind of pedestrian detecting system based on Area generation network
CN110175595B (en) * 2019-05-31 2021-03-02 北京金山云网络技术有限公司 Human body attribute recognition method, recognition model training method and device
CN110366107A (en) * 2019-07-09 2019-10-22 三星电子(中国)研发中心 Vehicle communication method and the device for using this method
CN111178253B (en) * 2019-12-27 2024-02-27 佑驾创新(北京)技术有限公司 Visual perception method and device for automatic driving, computer equipment and storage medium
US10911775B1 (en) * 2020-03-11 2021-02-02 Fuji Xerox Co., Ltd. System and method for vision-based joint action and pose motion forecasting

Also Published As

Publication number Publication date
CN111785085A (en) 2020-10-16
US11875546B2 (en) 2024-01-16
US20210387646A1 (en) 2021-12-16

Similar Documents

Publication Publication Date Title
CN111785085B (en) Visual perception and perception network training method, device, equipment and storage medium
CN105391970B (en) The method and system of at least one image captured by the scene camera of vehicle is provided
Soltani et al. Skeleton estimation of excavator by detecting its parts
Fritsch et al. Monocular road terrain detection by combining visual and spatial information
CN106169082A (en) Training grader is with the method and system of the object in detection target environment image
CN105512683A (en) Target positioning method and device based on convolution neural network
Nakajima et al. Semantic object selection and detection for diminished reality based on slam with viewpoint class
Zitnick et al. The role of image understanding in contour detection
CN109658442A (en) Multi-object tracking method, device, equipment and computer readable storage medium
Guo et al. Matching vehicles under large pose transformations using approximate 3d models and piecewise mrf model
US11748998B1 (en) Three-dimensional object estimation using two-dimensional annotations
CN111126393A (en) Vehicle appearance refitting judgment method and device, computer equipment and storage medium
Zelener et al. Cnn-based object segmentation in urban lidar with missing points
KR20210060535A (en) Analysis of dynamic spatial scenarios
Guo et al. Gesture recognition of traffic police based on static and dynamic descriptor fusion
Bruno et al. Analysis and fusion of 2d and 3d images applied for detection and recognition of traffic signs using a new method of features extraction in conjunction with deep learning
JP2014013432A (en) Featured value extraction device and featured value extraction method
CN109658523A (en) The method for realizing each function operation instruction of vehicle using the application of AR augmented reality
JP7078564B2 (en) Image processing equipment and programs
US20220245860A1 (en) Annotation of two-dimensional images
KR102301635B1 (en) Method of inferring bounding box using artificial intelligence model and computer apparatus of inferring bounding box
JP2014052977A (en) Association device and computer program
Lingtao et al. Object viewpoint classification based 3D bounding box estimation for autonomous vehicles
CN115393379A (en) Data annotation method and related product
Padilha et al. Motion-aware ghosted views for single layer occlusions in augmented reality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant