WO2021097845A1 - 一种仿真场景的图像生成方法、电子设备和存储介质 - Google Patents

一种仿真场景的图像生成方法、电子设备和存储介质 Download PDF

Info

Publication number
WO2021097845A1
WO2021097845A1 PCT/CN2019/120408 CN2019120408W WO2021097845A1 WO 2021097845 A1 WO2021097845 A1 WO 2021097845A1 CN 2019120408 W CN2019120408 W CN 2019120408W WO 2021097845 A1 WO2021097845 A1 WO 2021097845A1
Authority
WO
WIPO (PCT)
Prior art keywords
instance
network
segmentation information
information
scene
Prior art date
Application number
PCT/CN2019/120408
Other languages
English (en)
French (fr)
Inventor
于海泳
Original Assignee
驭势(上海)汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 驭势(上海)汽车科技有限公司 filed Critical 驭势(上海)汽车科技有限公司
Priority to CN201980002612.5A priority Critical patent/CN110998663B/zh
Priority to PCT/CN2019/120408 priority patent/WO2021097845A1/zh
Publication of WO2021097845A1 publication Critical patent/WO2021097845A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the embodiments of the present disclosure relate to the technical field, and in particular to a method for generating an image of a simulated scene, an electronic device, and a storage medium.
  • Simulation is an important part of technology exploration and technology verification and testing in the current research and development process of artificial intelligence technologies such as intelligent driving and robots.
  • simulation scenarios can generate massive training data to train computer vision algorithms (target detection and recognition). , Segmentation, tracking, etc.) and decision-making algorithms (imitation learning and reinforcement learning, etc.), as well as providing almost unlimited later algorithm verification test scenarios.
  • simulation scenarios need to be built.
  • the current process of building simulation scenarios is as follows: First, a lot of manpower and material resources are spent on site surveying and mapping, and then the model is manually built in the simulation engine based on the surveying and mapping data. Refine details such as color, texture, and lighting. It can be seen that the construction process of the simulation scene is cumbersome, time-consuming, laborious, and low in efficiency, and the built simulation scene has poor scalability and the simulation engine rendering requires high device hardware and software.
  • At least one embodiment of the present disclosure provides an image generation method, electronic device, and storage medium of a simulated scene.
  • an embodiment of the present disclosure proposes an image generation method of a simulation scene, the method includes:
  • the instance text information is editable information and used to describe the attributes of the instance;
  • an image of a simulation scene is generated.
  • the embodiments of the present disclosure also provide an electronic device, including: a processor and a memory; the processor is configured to execute the steps of the method described in the first aspect by calling a program or instruction stored in the memory.
  • the embodiments of the present disclosure also propose a non-transitory computer-readable storage medium for storing a program or instruction, the program or instruction causes a computer to execute the steps of the method described in the first aspect
  • the scene white model needs to be established, and then based on the semantic segmentation information and instance segmentation information of the scene white model, an image of the simulated scene can be generated without the need to refine the colors during the scene creation process. Attributes such as, texture, lighting, etc., improve the generation efficiency; and the text information of the examples can be edited, and the text information of different examples describes the attributes of different examples, corresponding to different examples, making the simulation scene diversified.
  • FIG. 1 is a schematic diagram of a simulation scenario provided by an embodiment of the present disclosure
  • Figure 2 is a block diagram of an electronic device provided by an embodiment of the present disclosure
  • Fig. 3 is a simulation scene image generation system provided by an embodiment of the present disclosure
  • FIG. 4 is a flowchart of an image generation method of a simulation scene provided by an embodiment of the present disclosure
  • FIG. 5 is an architecture diagram of a self-encoding network provided by an embodiment of the present disclosure.
  • FIG. 6 is an architecture diagram of a generative confrontation network provided by an embodiment of the present disclosure.
  • FIG. 7 is an architecture diagram of a discrimination network provided by an embodiment of the present disclosure.
  • the construction process for the current simulation scene is as follows: First, a lot of manpower and material resources are spent on site surveying and mapping, and then a model is manually established in the simulation engine based on the surveying and mapping data and details such as color, texture, and lighting are refined. It can be seen that the construction process of the simulation scene is cumbersome, time-consuming, laborious, and low in efficiency, and the built simulation scene has poor scalability and the simulation engine rendering requires high device hardware and software.
  • the embodiments of the present disclosure provide an image generation solution for a simulation scene. Only a scene white model needs to be established, and then based on the semantic segmentation information and instance segmentation information of the scene white model, an image of the simulation scene can be generated without the need for refinement in the scene creation process. Attributes such as color, texture, lighting, etc., improve the generation efficiency; moreover, instance text information can be edited, and different instance text information describes the attributes of different instances, corresponding to different instances, making the simulation scene diversified.
  • the image generation solution of the simulated scene provided by the embodiments of the present disclosure can be applied to electronic devices.
  • the simulation scene is, for example, an intelligent driving simulation scene
  • the simulation scene is, for example, a simulation scene generated by a simulation engine.
  • the simulation engine may include, but is not limited to: Unreal Engine (Unreal Engine), Unity, etc.
  • Fig. 1 is a schematic diagram of a simulation scenario provided by an embodiment of the present disclosure.
  • the simulation scenario may include, but is not limited to: green belts, sidewalks, motor vehicle lanes, street lights, trees, and other facilities in the real environment.
  • Objects and at least one virtual vehicle 101, intelligent driving vehicle 102, pedestrians, and other dynamic objects.
  • the virtual vehicle 101 may include a wayfinding system and other systems for driving.
  • the virtual vehicle 101 may include: a wayfinding system, a perception system, a decision-making system, a control system, and other systems for driving.
  • the path-finding system is used to construct a road network topology, and to find a path based on the constructed road network topology.
  • the path finding system is used to obtain a high-precision map, and based on the high-precision map, construct a road network topology.
  • the high-precision map is a geographic map used in the field of intelligent driving
  • the high-precision map is a map describing a simulation scene.
  • High-precision maps are different in that: 1) High-precision maps include a large amount of driving assistance information, such as the accurate three-dimensional representation of the road network: including intersections and road signs positions, etc.; 2) High-precision maps also Including a large amount of semantic information, such as reporting the meaning of different colors on traffic lights, and indicating the speed limit of the road, and the position where the left-turn lane starts; 3) High-precision maps can reach centimeter-level accuracy to ensure the safety of intelligent driving vehicles Driving. Therefore, the wayfinding path generated by the wayfinding system can provide a richer planning and decision-making basis for the decision-making system, such as the number of lanes at the current location, width, orientation, and the location of various traffic attachments.
  • the perception system is used for collision detection (Collision Detection). In some embodiments, the perception system is used to perceive obstacles in a simulated scene.
  • the decision-making system is used to decide the driving behavior of the virtual vehicle 101 based on the wayfinding path generated by the wayfinding system, the obstacles sensed by the perception system, and the kinematics information of the virtual vehicle 101 through a preset behavior tree (Behavior Tree).
  • the kinematics information includes, but is not limited to, speed, acceleration, and other motion-related information, for example.
  • the control system is used to control the driving behavior of the virtual vehicle 101 based on the decision made by the decision system, and feed back the kinematics information of the virtual vehicle 101 to the decision system.
  • each system in the virtual vehicle 101 is only a logical function division, and there may be other division methods in actual implementation.
  • the function of the wayfinding system can be integrated into the perception system, decision system or control system. Medium; any two or more systems can also be implemented as one system; any one system can also be divided into multiple subsystems.
  • each system or subsystem can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professional technicians can use different methods for each specific application to achieve the described functions.
  • the intelligent driving vehicle 102 at least includes: a sensor group and an intelligent driving system.
  • the sensor group is used to collect the data of the external environment of the vehicle and detect the position data of the vehicle.
  • the sensor group is also used to collect dynamic data of the vehicle.
  • the intelligent driving system is used to obtain the data of the sensor group, perform environmental perception and vehicle positioning based on the data of the sensor group, and perform path planning and decision-making based on the environmental perception information and vehicle positioning information, and generate vehicle control instructions based on the planned path to control The vehicle follows the planned route.
  • the virtual vehicle 101 and the intelligent driving vehicle 102 are both generated in a simulation scenario and are not real vehicles, the virtual vehicle 101 and the intelligent driving vehicle 102 can be controlled by a background processor, and the background processor can be a server. , Computers, tablet computers and other hardware devices with processing functions.
  • FIG. 2 is a block diagram of an electronic device provided by an embodiment of the disclosure.
  • the electronic equipment can support the operation of the simulation system.
  • the simulation system can provide simulation scenarios and generate virtual vehicles and provide other functions for simulation.
  • the simulation system may be a simulation system based on a simulation engine.
  • the electronic device includes: at least one processor 201, at least one memory 202, and at least one communication interface 203.
  • the various components in the electronic device are coupled together through the bus system 204.
  • the communication interface 203 is used for information transmission with external devices.
  • the bus system 204 is used to implement communication connections between these components.
  • the bus system 204 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 204 in FIG. 2.
  • the memory 202 in this embodiment may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the memory 202 stores the following elements, executable units or data structures, or a subset of them, or an extended set of them: operating systems and applications.
  • the operating system includes various system programs, such as a framework layer, a core library layer, and a driver layer, which are used to implement various basic services and process hardware-based tasks.
  • Application programs including various application programs, such as Media Player, Browser, etc., are used to implement various application services.
  • a program that implements the method for generating an image of a simulation scene provided by an embodiment of the present disclosure may be included in an application program.
  • the processor 201 calls a program or instruction stored in the memory 202, specifically, it may be a program or instruction stored in an application program, and the processor 201 is used to execute the image of the simulation scene provided by the embodiment of the present disclosure. Steps of each embodiment of the generating method.
  • the method for generating an image of a simulated scene may be applied to the processor 201 or implemented by the processor 201.
  • the processor 201 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 201 or instructions in the form of software.
  • the aforementioned processor 201 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method for generating an image of a simulated scene may be directly embodied as executed and completed by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor.
  • the software unit may be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 202, and the processor 201 reads the information in the memory 202 and completes the steps of the method in combination with its hardware.
  • FIG. 3 is a block diagram of a simulation scene image generation system 300 provided by an embodiment of the disclosure.
  • the simulation scene image generation system 300 may be implemented as a system running in the electronic device shown in FIG. 2 or a part of a simulation system running in the electronic device.
  • the simulation scene image generation system may be stored in the memory 202 of the electronic device shown in FIG. 2.
  • the processor 201 in FIG. 2 implements the functions of the units included in the simulation scene image generation system 300 by calling the simulation scene image generation system 300 stored in the memory 202.
  • the simulation scene image generation system 300 may be applied to the processor 201 of the electronic device shown in FIG. 2 or implemented by the processor 201.
  • the units of the simulation scene image generation system 300 can be completed by hardware integrated logic circuits in the processor 201 or instructions in the form of software.
  • the simulation scene image generation system 300 may be divided into multiple units, for example, it may include but not limited to: an acquisition unit 301, a receiving unit 302, and a generating unit 303.
  • the acquiring unit 301 is configured to acquire semantic segmentation information and instance segmentation information of the scene white model.
  • the scene white model can be understood as a scene model without adding color, texture, lighting and other attribute information.
  • the scene white model is established by a simulation engine, and the semantic segmentation information and instance segmentation information of the scene white model are generated by the simulation engine based on the scene white model.
  • the scene white model is manually established in the simulation engine, and there is no need to manually add attribute information such as color, texture, and lighting; the simulation engine can automatically generate semantic segmentation information and instance segmentation information based on the scene white model.
  • the semantic segmentation information is used to distinguish or describe different types of objects in the simulation scene: people, cars, animals, buildings, etc.; the instance segmentation information is used to distinguish or describe each object in the simulation scene: different people, Different cars, different animals, different buildings, etc. That is, for an object in the simulation scene, the semantic segmentation information indicates whether the object is a person or a car; if it is a car, the instance segmentation information indicates whether the car is an Audi or a Volkswagen; the instance text information indicates whether the car is a white car or a black car.
  • the receiving unit 302 is configured to receive instance text information of the scene white model; the instance text information is editable information and used to describe the attributes of the instance. By changing the content of the instance text information, the editing of instance attributes is realized, and different instance attributes correspond to different instances.
  • the instance text information of the scene white model is manually input, and in the process of manually inputting the instance text information, the content of the instance text information can be edited, and the receiving unit 302 receives the manually input instance text information.
  • the instance text information is used to describe the attributes of the instance, the instance text information is set as editable information, which realizes the editability of the attributes of the instance.
  • a simulation scene is a scene with editability of instance attributes.
  • the attributes of the instance may include, but are not limited to, color, texture, lighting, and the like.
  • the generating unit 303 is configured to generate an image of a simulation scene based on semantic segmentation information, instance segmentation information, instance text information, and pre-trained Generative Adversarial Networks (GAN).
  • the instance text information is not directly used as input for generating the confrontation network, but the generation unit 303 generates a feature map based on the instance text information and at least one real image corresponding to the scene white model. Among them, the real image will only be provided during the training process.
  • the generating unit 303 generates an image of a simulation scene through a pre-trained generating confrontation network based on the semantic segmentation information, the instance segmentation information, and the feature map.
  • the generation unit 303 cascades semantic segmentation information, instance segmentation information, and feature maps (essentially vector cascade, such as cascading in the channel dimension, or element corresponding addition), and then inputs the pre-trained generation confrontation Network to generate images of simulated scenes.
  • a confrontation network is generated from the input of the feature map to adjust the color, texture, lighting and other attributes of the instances in the scene.
  • the generating unit 303 generates the image of the simulation scene as a high-resolution image, and the simulation scene is a high-resolution scene, which is convenient for technology exploration and technology verification testing in the artificial intelligence technology research and development process.
  • the generating unit 303 generates a feature map based on the instance text information and at least one real image corresponding to the scene white model, specifically: embedding the instance text information and condition enhancement processing to obtain the processing result; At least one real image corresponding to the white mode is encoded to obtain the hidden variable corresponding to each real image, where the hidden variable can be understood as an intermediate variable, and one image corresponds to one hidden variable; the hidden variable corresponding to each real image Sampling is performed to obtain sampling results.
  • hidden variable sampling is used to adjust the attribute information of the instances in the simulation scene to realize the diversification of the images of the simulation scene; the processing results and sampling results are decoded to generate a feature map.
  • the generation unit 303 performs embedding processing and conditional enhancement processing on the instance text information to obtain the processing result, specifically: inputting the instance text information into a pre-trained embedding network, and the output of the embedding network is passed through the pre-trained embedding network. Conditioning Augmentation network to get the processing result.
  • the embedded network and the conditional enhancement network are both neural networks and the network parameters are obtained through pre-training.
  • the generating unit 303 inputs at least one real image corresponding to the white mode of the scene into the encoder of the pre-trained self-encoding network for encoding processing to obtain the hidden variables corresponding to each real image;
  • the hidden variables corresponding to the real image are sampled to obtain the sampling result;
  • the decoder of the self-encoding network decodes the processing result and the sampling result to generate a feature map.
  • the self-encoding network is a variational self-encoding network.
  • the architecture of the self-encoding network is shown in Figure 5, including a convolutional layer and a deconvolutional layer, where the convolutional layer can be understood as the encoder of the self-encoding network, and the deconvolutional layer can be understood as the self-encoding network.
  • Decoder for encoding network The input information of the self-encoding network is at least one real image corresponding to the scene white mode, that is, the input of the convolutional layer of the self-encoding network is at least one real image corresponding to the scene white mode.
  • the output information of the self-encoding grid is a feature map, that is, the output of the deconvolution layer of the self-encoding network is a feature map.
  • the example text information is input to a pre-trained embedding network, and the output of the embedding network is a set of low-dimensional vectors.
  • the output of the embedding network is enhanced through the pre-trained conditional enhancement network to obtain the processing result.
  • the self-encoding network samples the hidden variables corresponding to each real image to obtain the sampling results.
  • the processing result and the sampling result are cascaded (essentially vector cascade, such as cascading in the channel dimension, or element correspondingly added) and then input to the deconvolution layer of the self-encoding network for decoding processing to generate a feature map.
  • the generative confrontation network used in the generation unit 303 includes a generative network and a discriminant network.
  • the generative network is composed of multiple nested generators, and the generator includes a convolutional layer and a deconvolution.
  • the output of the last layer of the feature map of the deconvolution layer of the generator nested inside is used as the input of the deconvolution layer of the generator nested outside.
  • the discriminant network since the discriminant network is mainly used to train the generative network, after the training of the generative network is completed, it will independently generate the image of the simulation scene. Therefore, in the following, when describing the function of the generative network, the generative adversarial network is used instead of the generative network.
  • the network that is, the image of the simulation scene generated by the generation of the confrontation network, those skilled in the art can understand that the image of the simulation scene is generated by the generation network of the generation of the confrontation network.
  • the discriminant network alone does not mean that the discriminant network is not a generative adversarial network.
  • the generating unit 303 is specifically configured to: input semantic segmentation information, instance segmentation information, and feature maps to the convolutional layer of the generator that generates the outermost layer of the confrontation network; After the graph is down-sampled, it is input to the convolutional layer of the generator that generates the inner layer of the confrontation network; the deconvolution layer of the generator that generates the outermost layer of the confrontation network outputs the image of the simulation scene.
  • the multiples of downsampling processing corresponding to different generators in the inner layer may be different.
  • the input of the nested generator needs to be down-sampled, so that the resolution of the output is reduced, so that the overall information of the output is paid attention to.
  • the output of the deconvolution layer of the outermost generator is the output of the generated adversarial network, with a higher resolution and attention to the detailed information of the output. On the whole, the generated images of the simulation scene output by the confrontation network pay attention to both the whole and the details.
  • the architecture of the generative adversarial network is shown in Figure 6, which consists of N (N ⁇ 3) generators nested, denoted as generator 1, generator 2, ..., from the inside to the outside. ⁇ N.
  • Each generator includes a convolutional layer and a deconvolutional layer.
  • the input information for generating the confrontation network is semantic segmentation information, instance segmentation information and feature maps.
  • the output information of the generated confrontation network is the image of the simulation scene, that is, the output of the deconvolution layer of the generator N is the image of the simulation scene.
  • the input information of the generated confrontation network is input to the convolutional layer of the generator N.
  • the input information of the generated confrontation network is down-sampled and then input to the convolutional layer of generator 2.
  • the input information of the generated confrontation network is down-sampled again and then input to the convolutional layer of generator 1.
  • the purpose of downsampling is to reduce the resolution, for example, the reduction ratio is 1/2 ⁇ 1/2; if the output of generator N is 200 ⁇ 200 resolution, the output of generator 2 is 100 ⁇ 100 resolution, The output of generator 1 is 50 ⁇ 50 resolution. It can be seen that the resolution of generator N is high, and more attention is paid to details; the resolution of generator 2 and generator 1 is low, and the whole is more concerned. Therefore, it is more reasonable to generate high-definition images output by the confrontation network, paying attention to both the whole and the details.
  • the generative confrontation network, the embedded network, the condition enhancement network, and the self-encoding network used by the generating unit 303 are obtained through joint training.
  • the joint training may include: acquiring semantic segmentation information, instance segmentation information, instance text information, and sample images of the sample scene; and then performing joint training based on the semantic segmentation information, instance segmentation information, instance text information, and sample images.
  • the generative confrontation network, embedding network, condition enhancement network, and self-encoding network used by the generating unit 303 perform joint training based on semantic segmentation information, instance segmentation information, instance text information, and sample images, specifically:
  • the output of the deconvolution layer of the processor generates images; images, sample images, semantic segmentation information, instance segmentation information and feature maps will be generated, and training will be completed through the discriminant network.
  • the generated images output by the generated confrontation network are fake pictures.
  • their feature values are marked as "fake” to indicate that they are randomly generated pictures rather than real pictures.
  • the sample image is a real picture, and its feature value can be marked as "real”. Images, sample images, semantic segmentation information, instance segmentation information, and feature maps will be generated.
  • the discriminant network can more accurately judge the real picture and the fake picture, thereby giving feedback to the generation of the confrontation network, so that the generation Confronting the network to generate fake and real pictures.
  • the discriminant network can continue training when the judgment probability value of each discriminator is not satisfied to 0.5, until the training goal is satisfied through multiple iterations.
  • the "training target” may be a preset target of whether the generated image of the generated confrontation network meets the requirements.
  • the training target for generating the confrontation network may be, for example, that the predicted feature value of the picture meets a specified requirement, for example, close to 0.5. After judging that the convergence to 0.5 is satisfied, the training is stopped.
  • the discriminant network is composed of multiple discriminators cascaded; the input of the uppermost discriminator is generated image, sample image, semantic segmentation information, instance segmentation information, and feature map; generated image, sample image, semantic The segmentation information, the instance segmentation information, and the feature map are input to the lower-level discriminator after down-sampling processing; wherein, the multiples of the down-sampling processing corresponding to the discriminators of different levels can be different.
  • the architecture of the discriminating network is shown in Fig. 7, which is composed of N (N ⁇ 3) discriminators cascaded, denoted as discriminator 1, discriminator 2, ..., discriminator from top to bottom. N.
  • the input information of the discriminant network is generated image, sample image, semantic segmentation information, instance segmentation information and feature map.
  • the output information of the discrimination network is the judgment probability value.
  • the input information of the discrimination network is input to the discriminator 1.
  • the input information of the discriminating network is down-sampled and then input to the discriminator 2.
  • the input information of the discriminating network is down-sampled again and then input to the discriminator N. If the judgment probability values output by the discriminator 1, the discriminator 2 and the discriminator 3 all converge to 0.5, the joint training ends.
  • the simulation scene image generation system 300 may be a software system, a hardware system, or a combination of software and hardware.
  • the simulation scene image generation system 300 is a software system running on an operating system
  • the hardware system of an electronic device is a hardware system supporting the operation of the operating system.
  • the division of each unit in the simulation scene image generation system 300 is only a logical function division.
  • the acquiring unit 301, the receiving unit 302, and the generating unit 303 may be implemented as One unit; the acquiring unit 301, the receiving unit 302, or the generating unit 303 can also be divided into multiple sub-units.
  • each unit or subunit can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Those skilled in the art can use different methods for each specific application to realize the described functions.
  • FIG. 4 is a flowchart of a method for generating an image of a simulation scene provided by an embodiment of the disclosure.
  • the execution body of the method is an electronic device.
  • the execution body of the method is a simulation scene image generation system running in the electronic device; or the execution body of the method is a simulation system running in the electronic device, where the simulation The scene image generation system can be a part of the simulation system.
  • the image generation method of the simulation scene may include but is not limited to the following steps 401 to 403:
  • the scene white model can be understood as a scene model without adding color, texture, lighting and other attribute information.
  • the scene white model is established by a simulation engine, and the semantic segmentation information and instance segmentation information of the scene white model are generated by the simulation engine based on the scene white model.
  • the scene white model is manually established in the simulation engine, and there is no need to manually add attribute information such as color, texture, and lighting; the simulation engine can automatically generate semantic segmentation information and instance segmentation information based on the scene white model.
  • the semantic segmentation information is used to distinguish or describe different types of objects in the simulation scene: people, cars, animals, buildings, etc.; the instance segmentation information is used to distinguish or describe each object in the simulation scene: different people, Different cars, different animals, different buildings, etc. That is, for an object in the simulation scene, the semantic segmentation information indicates whether the object is a person or a car; if it is a car, the instance segmentation information indicates whether the car is an Audi or a Volkswagen; the instance text information indicates whether the car is a white car or a black car.
  • the instance text information of the scene white model is editable information and is used to describe the attributes of the instance.
  • the instance text information of the scene white model is manually input, and in the process of manually inputting the instance text information, the content of the instance text information can be edited, and step 402 receives the manually input instance text information.
  • the instance text information is used to describe the attributes of the instance, the instance text information is set as editable information, which realizes the editability of the attributes of the instance.
  • a simulation scene is a scene with editability of instance attributes.
  • the attributes of the instance may include, but are not limited to, color, texture, lighting, and the like.
  • an image of a simulation scene based on the semantic segmentation information, the instance segmentation information, the instance text information, and a pre-trained generating confrontation network.
  • the instance text information is not directly used as input for generating a confrontation network, but a feature map is generated based on the instance text information and at least one real image corresponding to the scene white model.
  • an image of a simulation scene is generated through a pre-trained generating confrontation network.
  • the semantic segmentation information, the instance segmentation information, and the feature map are cascaded (essentially vector cascade) and then input into a pre-trained generative confrontation network to generate an image of a simulated scene.
  • a confrontation network is generated from the input of the feature map to adjust the color, texture, lighting and other attributes of the instances in the scene.
  • the image that generates the simulation scene is a high-resolution image
  • the simulation scene is a high-resolution scene, which is convenient for technology exploration and technology verification testing in the process of artificial intelligence technology research and development.
  • the feature map is generated based on at least one real image corresponding to the instance text information and the scene white model, specifically: embedding the instance text information and condition enhancement processing to obtain the processing result; corresponding the scene white model At least one of the real images is encoded to obtain the hidden variables corresponding to each real image.
  • the hidden variables can be understood as intermediate variables, and one image corresponds to one hidden variable; the hidden variables corresponding to each real image are sampled, The sampling result is obtained, where the attribute information of the instance in the simulation scene is adjusted through hidden variable sampling to realize the diversification of the image of the simulation scene; the processing result and the sampling result are decoded to generate a feature map.
  • the instance text information is embedded and conditionally enhanced to obtain the processing result, specifically: the instance text information is input into a pre-trained embedding network, and the output of the embedding network is enhanced by the pre-trained condition ( Conditioning Augmentation) network to get the processing result.
  • the embedded network and the conditional enhancement network are both neural networks and the network parameters are obtained through pre-training.
  • At least one real image corresponding to the white mode of the scene is input to the encoder of the pre-trained self-encoding network for encoding processing to obtain the hidden variables corresponding to each real image; the self-encoding network corresponds to each real image The latent variable of the sample is sampled to obtain the sampling result; the decoder of the self-encoding network decodes the processing result and the sampling result to generate a feature map.
  • the self-encoding network is a variational self-encoding network.
  • the architecture of the self-encoding network is shown in Figure 5, including a convolutional layer and a deconvolutional layer, where the convolutional layer can be understood as the encoder of the self-encoding network, and the deconvolutional layer can be understood as the self-encoding network.
  • Decoder for encoding network The input information of the self-encoding network is at least one real image corresponding to the scene white mode, that is, the input of the convolutional layer of the self-encoding network is at least one real image corresponding to the scene white mode.
  • the output information of the self-encoding grid is a feature map, that is, the output of the deconvolution layer of the self-encoding network is a feature map.
  • the example text information is input to a pre-trained embedding network, and the output of the embedding network is a set of low-dimensional vectors.
  • the output of the embedding network is enhanced through the pre-trained conditional enhancement network to obtain the processing result.
  • the self-encoding network samples the hidden variables corresponding to each real image to obtain the sampling results.
  • the processing result and the sampling result are cascaded (essentially vector cascade) and then input to the deconvolution layer of the self-encoding network for decoding processing to generate a feature map.
  • the generative adversarial network includes a generative network and a discriminant network, where the generative network is composed of multiple nested generators, where the generator includes a convolutional layer and a deconvolutional layer, and is nested inside The output of the last feature map of the deconvolution layer of the generator is the input of the deconvolution layer of the generator nested outside.
  • the discriminant network since the discriminant network is mainly used to train the generative network, after the training of the generative network is completed, it will independently generate the image of the simulation scene. Therefore, in the following, when describing the function of the generative network, the generative adversarial network is used instead of the generative network.
  • the network that is, the image of the simulation scene generated by the generation of the confrontation network, those skilled in the art can understand that the image of the simulation scene is generated by the generation network of the generation of the confrontation network.
  • the discriminant network alone does not mean that the discriminant network is not a generative adversarial network.
  • the semantic segmentation information, instance segmentation information, and feature map are input to the convolutional layer of the generator that generates the outermost layer of the confrontation network; the semantic segmentation information, instance segmentation information, and feature map are down-sampled and input To generate the convolutional layer of the generator of the inner layer of the confrontation network; generate the deconvolution layer of the generator of the outermost layer of the confrontation network to output the image of the simulation scene.
  • the multiples of downsampling processing corresponding to different generators in the inner layer may be different.
  • the input of the nested generator needs to be down-sampled, so that the resolution of the output is reduced, so that the overall information of the output is paid attention to.
  • the output of the deconvolution layer of the outermost generator is the output of the generated adversarial network, with a higher resolution and attention to the detailed information of the output. On the whole, the generated images of the simulation scene output by the confrontation network pay attention to both the whole and the details.
  • the architecture of the generative adversarial network is shown in Figure 6, which consists of N (N ⁇ 3) generators nested, denoted as generator 1, generator 2, ..., from the inside to the outside. ⁇ N.
  • Each generator includes a convolutional layer and a deconvolutional layer.
  • the input information for generating the confrontation network is semantic segmentation information, instance segmentation information and feature maps.
  • the output information of the generated confrontation network is the image of the simulation scene, that is, the output of the deconvolution layer of the generator N is the image of the simulation scene.
  • the input information of the generated confrontation network is input to the convolutional layer of the generator N.
  • the input information of the generated confrontation network is down-sampled and then input to the convolutional layer of generator 2.
  • the input information of the generated confrontation network is down-sampled again and then input to the convolutional layer of generator 1.
  • the purpose of downsampling is to reduce the resolution, for example, the reduction ratio is 1/2 ⁇ 1/2; if the output of generator N is 200 ⁇ 200 resolution, the output of generator 2 is 100 ⁇ 100 resolution, The output of generator 1 is 50 ⁇ 50 resolution. It can be seen that generator N has a high resolution and pays more attention to details; generator 2 and generator 1 have low resolution and pay more attention to the whole. Therefore, it is more reasonable to generate high-definition images output by the confrontation network, paying attention to both the whole and the details.
  • the generative confrontation network, the embedded network, the conditional enhancement network, and the self-encoding network are obtained through joint training.
  • the joint training may include: acquiring semantic segmentation information, instance segmentation information, instance text information, and sample images of the sample scene; and then performing joint training based on the semantic segmentation information, instance segmentation information, instance text information, and sample images.
  • generating a confrontation network, an embedding network, a conditional enhancement network, and a self-encoding network perform joint training based on semantic segmentation information, instance segmentation information, instance text information, and sample images, specifically:
  • the output of the deconvolution layer of the processor generates images; images, sample images, semantic segmentation information, instance segmentation information and feature maps will be generated, and training will be completed through the discriminant network.
  • the generated images output by the generated confrontation network are fake pictures.
  • their feature values are marked as "fake” to indicate that they are randomly generated pictures rather than real pictures.
  • the sample image is a real picture, and its feature value can be marked as "real”. Images, sample images, semantic segmentation information, instance segmentation information, and feature maps will be generated.
  • the discriminant network can more accurately judge the real picture and the fake picture, thereby giving feedback to the generation of the confrontation network, so that the generation Confronting the network to generate fake and real pictures.
  • the discriminant network can continue training when the judgment probability value of each discriminator is not satisfied to 0.5, until the training goal is satisfied through multiple iterations.
  • the "training target” may be a preset target of whether the generated image of the generated confrontation network meets the requirements.
  • the training target for generating the confrontation network may be, for example, that the predicted feature value of the picture meets a specified requirement, for example, close to 0.5. After judging that the convergence to 0.5 is satisfied, the training is stopped.
  • the discriminant network is composed of multiple discriminators cascaded; the input of the uppermost discriminator is generated image, sample image, semantic segmentation information, instance segmentation information, and feature map; generated image, sample image, semantic The segmentation information, the instance segmentation information, and the feature map are input to the lower-level discriminator after down-sampling processing; wherein, the multiples of the down-sampling processing corresponding to the discriminators of different levels can be different.
  • the architecture of the discriminating network is shown in Fig. 7, which is composed of N (N ⁇ 3) discriminators cascaded, denoted as discriminator 1, discriminator 2, ..., discriminator from top to bottom. N.
  • the input information of the discriminant network is generated image, sample image, semantic segmentation information, instance segmentation information and feature map.
  • the output information of the discrimination network is the judgment probability value.
  • the input information of the discrimination network is input to the discriminator 1.
  • the input information of the discriminating network is down-sampled and then input to the discriminator 2.
  • the input information of the discriminating network is down-sampled again and then input to the discriminator N. If the judgment probability values output by the discriminator 1, the discriminator 2 and the discriminator 3 all converge to 0.5, the joint training ends.
  • the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium that stores a program or instruction that causes a computer to execute an image generation method such as a simulation scene. In order to avoid repeating the description, I won’t repeat them here.
  • the scene white model needs to be established, and then based on the semantic segmentation information and instance segmentation information of the scene white model, an image of the simulated scene can be generated without the need to refine the color, texture, and lighting attributes during the scene creation process.
  • improve the generation efficiency moreover, the text information of the examples can be edited, and the text information of different examples describes the attributes of different examples, corresponding to different examples, making the simulation scene diversified and has industrial applicability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种仿真场景的图像生成方法、电子设备和存储介质,方法包括:获取场景白模的语义分割信息和实例分割信息(401);接收场景白模的实例文本信息;实例文本信息为可编辑的信息,且用于描述实例的属性(402);基于语义分割信息、实例分割信息、实例文本信息和预先训练的生成对抗网络,生成仿真场景的图像(403)。所述方法中,只需建立场景白模,进而基于场景白模的语义分割信息和实例分割信息,可生成仿真场景的图像,无需在场景建立过程中细化颜色、纹理、光照等属性,提高生成效率;并且,实例文本信息可编辑,不同实例文本信息描述不同实例的属性,对应不同的实例,使得仿真场景多样化。

Description

一种仿真场景的图像生成方法、电子设备和存储介质 技术领域
本公开实施例涉及技术领域,具体涉及一种仿真场景的图像生成方法、电子设备和存储介质。
背景技术
仿真模拟是目前智能驾驶、机器人等人工智能技术研发过程中技术探索和技术验证测试的重要环节,尤其是在目前的智能驾驶领域,仿真场景可以产生海量的训练数据训练计算机视觉算法(目标检测识别、分割、跟踪等)和决策算法(模仿学习和强化学习等),以及提供后期几乎无限制的算法验证测试场景。
对于计算机视觉算法进行仿真场景的训练和验证,需要搭建仿真场景,然而,目前仿真场景的搭建过程为:首先是花费大量的人力物力去现场测绘,然后根据测绘数据在仿真引擎中手工建立模型并细化颜色、纹理、光照等细节。可见,仿真场景的搭建过程繁琐,费时费力,效率低,并且搭建的仿真场景可扩展性差且仿真引擎渲染对设备硬件软件要求高。
上述对问题的发现过程的描述,仅用于辅助理解本公开的技术方案,并不代表承认上述内容是现有技术。
发明内容
为了解决现有技术存在的至少一个问题,本公开的至少一个实施例提供了一种仿真场景的图像生成方法、电子设备和存储介质。
第一方面,本公开实施例提出一种仿真场景的图像生成方法,所述方法包括:
获取场景白模的语义分割信息和实例分割信息;
接收所述场景白模的实例文本信息;所述实例文本信息为可编辑的信息,且用于描述实例的属性;
基于所述语义分割信息、所述实例分割信息、所述实例文本信息和预先训练的生成对抗网络,生成仿真场景的图像。
第二方面,本公开实施例还提出一种电子设备,包括:处理器和存储器;所述处 理器通过调用所述存储器存储的程序或指令,用于执行如第一方面所述方法的步骤。
第三方面,本公开实施例还提出一种非暂态计算机可读存储介质,用于存储程序或指令,所述程序或指令使计算机执行如第一方面所述方法的步骤
可见,本公开实施例的至少一个实施例中,只需建立场景白模,进而基于场景白模的语义分割信息和实例分割信息,可生成仿真场景的图像,无需在场景建立过程中细化颜色、纹理、光照等属性,提高生成效率;并且,实例文本信息可编辑,不同实例文本信息描述不同实例的属性,对应不同的实例,使得仿真场景多样化。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1是本公开实施例提供的一种仿真场景示意图;
图2是本公开实施例提供的一种电子设备的框图;
图3是本公开实施例提供的一种仿真场景图像生成系统;
图4是本公开实施例提供的一种仿真场景的图像生成方法流程图;
图5是本公开实施例提供的一种自编码网络的架构图;
图6是本公开实施例提供的一种生成对抗网络的架构图;
图7是本公开实施例提供的一种判别网络的架构图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。基于所描述的本公开的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。
针对目前仿真场景的搭建过程为:首先是花费大量的人力物力去现场测绘,然后根据测绘数据在仿真引擎中手工建立模型并细化颜色、纹理、光照等细节。可见,仿真场景的搭建过程繁琐,费时费力,效率低,并且搭建的仿真场景可扩展性差且仿真引擎渲染对设备硬件软件要求高。本公开实施例提供一种仿真场景的图像生成方案,只需建立场景白模,进而基于场景白模的语义分割信息和实例分割信息,可生成仿真场景的图像,无需在场景建立过程中细化颜色纹理光照等属性,提高生成效率;并且,实例文本信息可编辑,不同实例文本信息描述不同实例的属性,对应不同的实例,使得仿真场景多样化。
在一些实施例中,本公开实施例提供的仿真场景的图像生成方案,可应用于电子设备。仿真场景例如为智能驾驶仿真场景,仿真场景例如为仿真引擎生成的仿真场景。在一些实施例中,仿真引擎可包括但不限于:虚幻引擎(Unreal Engine)、Unity等。
图1为本公开实施例提供的一种仿真场景示意图,如图1所示,仿真场景中可包括但不限于:绿化带、人行道、机动车道、路灯、树木以及真实环境中的其他设施等静态对象;以及至少一辆虚拟车辆101、智能驾驶车辆102、行人以及其他动态对象。
虚拟车辆101可包括:寻路系统以及其他用于行驶的系统。在一些实施例中,虚拟车辆101可包括:寻路系统、感知系统、决策系统、控制系统以及其他用于行驶的系统。
寻路系统用于构建路网拓扑结构,并基于构建的构建路网拓扑结构进行寻路。在一些实施例中,寻路系统用于获取高精度地图,并基于高精度地图,构建路网拓扑结构。其中,高精度地图为智能驾驶领域中使用的地理地图,且高精度地图为描述仿真场景的地图。高精度地图与传统地图相比,不同之处在于:1)高精度地图包括大量的驾驶辅助信息,例如依托道路网的精确三维表征:包括交叉路口局和路标位置等;2)高精度地图还包括大量的语义信息,例如报告交通灯上不同颜色的含义,又例如指示道路的速度限制,以及左转车道开始的位置;3)高精度地图能达到厘米级的精度,确保智能驾驶车辆的安全行驶。因此,寻路系统生成的寻路路径可以为决策系统提供更加丰富的规划决策依据,例如当前位置的车道数目,宽度,朝向,各种交通附属物的位置等。
感知系统用于进行碰撞检测(Collision Detection)。在一些实施例中,感知系统用 于感知仿真场景中的障碍物。
决策系统用于基于寻路系统生成的寻路路径、感知系统感知的障碍物和虚拟车辆101的运动学信息,通过预设的行为树(Behavior Tree),决策虚拟车辆101的驾驶行为。其中,运动学信息例如包括但不限于速度、加速度和其他与运动相关的信息。
控制系统用于基于决策系统决策的驾驶行为,控制虚拟车辆101行驶,并将虚拟车辆101的运动学信息反馈给决策系统。
在一些实施例中,虚拟车辆101中各系统的划分仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,寻路系统的功能可集成到感知系统、决策系统或控制系统中;任意两个或两个以上系统也可以实现为一个系统;任意一个系统也可以划分为多个子系统。可以理解的是,各个系统或子系统能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能。
智能驾驶车辆102至少包括:传感器组和智能驾驶系统。传感器组用于采集车辆外界环境的数据和探测车辆的位置数据。在一些实施例中,传感器组还用于采集车辆的动力学数据。智能驾驶系统用于获取传感器组的数据,基于传感器组的数据进行环境感知和车辆定位,并基于环境感知信息和车辆定位信息进行路径规划和决策,以及基于规划的路径生成车辆控制指令,从而控制车辆按照规划路径行驶。
需要说明的是,由于虚拟车辆101和智能驾驶车辆102均为仿真场景中生成的,并非真实车辆,因此,虚拟车辆101和智能驾驶车辆102可由后台处理器来控制行驶,后台处理器可以是服务器、计算机、平板电脑等具有处理功能的硬件设备。
图2为本公开实施例提供的一种电子设备的框图。电子设备可支持仿真系统的运行。其中,仿真系统可提供仿真场景并生成虚拟车辆以及提供其他用于仿真的功能。仿真系统可以为基于仿真引擎的仿真系统。
如图2所示,电子设备包括:至少一个处理器201、至少一个存储器202和至少一个通信接口203。电子设备中的各个组件通过总线系统204耦合在一起。通信接口203,用于与外部设备之间的信息传输。可理解,总线系统204用于实现这些组件之间的通信连接。总线系统204除包括数据总线之外,还包括电源总线、控制总线和状态信号 总线。但为了清楚说明起见,在图2中将各种总线都标为总线系统204。
可以理解,本实施例中的存储器202可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。
在一些实施方式中,存储器202存储了如下的元素,可执行单元或者数据结构,或者他们的子集,或者他们的扩展集:操作系统和应用程序。
其中,操作系统,包含各种系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。应用程序,包含各种应用程序,例如媒体播放器(Media Player)、浏览器(Browser)等,用于实现各种应用业务。实现本公开实施例提供的仿真场景的图像生成方法的程序可以包含在应用程序中。
在本公开实施例中,处理器201通过调用存储器202存储的程序或指令,具体的,可以是应用程序中存储的程序或指令,处理器201用于执行本公开实施例提供的仿真场景的图像生成方法各实施例的步骤。
本公开实施例提供的仿真场景的图像生成方法可以应用于处理器201中,或者由处理器201实现。处理器201可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器201中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器201可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本公开实施例提供的仿真场景的图像生成方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。软件单元可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器202,处理器201读取存储器202中的信息,结合其硬件完成方法的步骤。
图3为本公开实施例提供的一种仿真场景图像生成系统300的框图。在一些实施例中,仿真场景图像生成系统300可以实现为图2所示的电子设备中运行的系统,或者电子设备中运行的仿真系统的一部分。在一些实施例中,仿真场景图像生成系统可 存储在图2所示的电子设备的存储器202中。图2中处理器201通过调用存储器202存储的仿真场景图像生成系统300,实现仿真场景图像生成系统300包括的各单元的功能。在一些实施例中,仿真场景图像生成系统300可应用于图2所示的电子设备的处理器201中,或者由处理器201实现。仿真场景图像生成系统300的各单元可以通过处理器201中的硬件的集成逻辑电路或者软件形式的指令完成。
如图3所示,仿真场景图像生成系统300可划分为多个单元,例如可包括但不限于:获取单元301、接收单元302和生成单元303。
获取单元301,用于获取场景白模的语义分割信息和实例分割信息。其中,场景白模可以理解为没有添加颜色、纹理、光照等属性信息的场景模型。在一些实施例中,场景白模由仿真引擎建立,且场景白模的语义分割信息和实例分割信息由所述仿真引擎基于场景白模生成。例如,场景白模由人工在仿真引擎中建立,且无需人工添加颜色、纹理、光照等属性信息;仿真引擎基于场景白模可自动生成语义分割信息和实例分割信息。
在一些实施例中,语义分割信息用于区分或描述仿真场景中不同类别的物体:人、车、动物、建筑等;实例分割信息用于区分或描述仿真场景中每个物体:不同的人、不同的车、不同的动物、不同的建筑等。也即,对于仿真场景中的一个物体,语义分割信息表明该物体是人还是车;如果是车,实例分割信息表明该车是奥迪还是大众;实例文本信息表明该车是白车还是黑车。
接收单元302,用于接收所述场景白模的实例文本信息;所述实例文本信息为可编辑的信息,且用于描述实例的属性。通过改变实例文本信息的内容,实现对实例属性的编辑,不同实例属性对应不同的实例。在一些实施例中,场景白模的实例文本信息由人工输入,且人工输入实例文本信息的过程中,可编辑实例文本信息的内容,接收单元302接收人工输入的实例文本信息。本实施例中,由于采用实例文本信息描述实例的属性,因此将实例文本信息设置为可编辑的信息,实现了对实例的属性的可编辑性。进而仿真场景是具备实例属性可编辑性的场景。在一些实施例中,实例的属性可包括但不限于颜色、纹理、光照等。
生成单元303,用于基于语义分割信息、实例分割信息、实例文本信息和预先训练的生成对抗网络(Generative Adversarial Networks,GAN),生成仿真场景的图像。在 一些实施例中,实例文本信息并非直接作为生成对抗网络的输入,而是由生成单元303基于实例文本信息和场景白模对应的至少一张真实图像,生成特征图。其中,真实图像仅在训练过程中会提供。进而生成单元303基于语义分割信息、实例分割信息和特征图,通过预先训练的生成对抗网络,生成仿真场景的图像。在一些实施例中,生成单元303将语义分割信息、实例分割信息和特征图级联(实质为向量级联,例如在channel维度进行级联,或者元素对应相加)后输入预先训练的生成对抗网络,生成仿真场景的图像。
本实施例中,由特征图输入生成对抗网络来调整场景中实例的颜色、纹理、光照等属性。另外,生成单元303生成仿真场景的图像为高分辨率图像,仿真场景为高分辨率场景,便于人工智能技术研发过程中技术探索和技术验证测试。
在一些实施例中,生成单元303基于实例文本信息和场景白模对应的至少一张真实图像,生成特征图,具体为:将实例文本信息进行嵌入处理和条件增强处理,得到处理结果;将场景白模对应的至少一张真实图像进行编码处理,得到每张真实图像对应的隐变量,其中,隐变量可以理解为中间变量,一张图像对应一个隐变量;将每张真实图像对应的隐变量进行采样,得到采样结果,其中,通过隐变量采样来调整仿真场景中实例的属性信息,实现仿真场景的图像的多样化;将处理结果和采样结果进行解码处理,生成特征图。
在一些实施例中,生成单元303将实例文本信息进行嵌入处理和条件增强处理,得到处理结果,具体为:将实例文本信息输入预先训练的嵌入(Embedding)网络,嵌入网络的输出通过预先训练的条件增强(Conditioning Augmentation)网络,得到处理结果。其中,嵌入网络和条件增强网络均为神经网络且网络参数通过预先训练得到。
在一些实施例中,生成单元303将场景白模对应的至少一张真实图像输入预先训练的自编码网络的编码器进行编码处理,得到每张真实图像对应的隐变量;自编码网络将每张真实图像对应的隐变量进行采样,得到采样结果;自编码网络的解码器将处理结果和采样结果进行解码处理,生成特征图。在一些实施例中,自编码网络为变分自编码网络。
在一些实施例中,自编码网络的架构如图5所示,包括卷积层和反卷积层,其中,卷积层可以理解为自编码网络的编码器,反卷积层可以理解为自编码网络的解码器。 自编码网络的输入信息为场景白模对应的至少一张真实图像,也即自编码网络的卷积层的输入为场景白模对应的至少一张真实图像。自编码网格的输出信息为特征图,也即自编码网络的反卷积层的输出为特征图。
图5中,实例文本信息输入预先训练的嵌入网络,嵌入网络的输出为一组低维度向量,嵌入网络的输出通过预先训练的条件增强网络,得到处理结果。自编码网络将每张真实图像对应的隐变量进行采样,得到采样结果。处理结果和采样结果级联(实质为向量级联,例如在channel维度进行级联,或者元素对应相加)后输入到自编码网络的反卷积层进行解码处理,生成特征图。
在一些实施例中,生成单元303中所使用的生成对抗网络,包括生成网络和判别网络,其中,生成网络由嵌套的多个生成器构成,其中,生成器包括卷积层和反卷积层,且嵌套在内的生成器的反卷积层的最后一层特征图输出作为嵌套在外的生成器的反卷积层的输入。
需要说明的是,由于判别网络主要用于训练生成网络,生成网络训练完成后,会独立生成仿真场景的图像,因此,下文中,在描述生成网络的功能时,均采用了生成对抗网络替代生成网络,也即生成对抗网络生成仿真场景的图像,本领域技术人员可以理解是由生成对抗网络的生成网络生成仿真场景的图像。在描述联合训练时,单独提及判别网络,不代表判别网络不属于生成对抗网络。
在一些实施例中,生成单元303具体用于:将语义分割信息、实例分割信息和特征图输入到生成对抗网络最外层的生成器的卷积层;将语义分割信息、实例分割信息和特征图进行下采样处理后输入到生成对抗网络内层的生成器的卷积层;生成对抗网络最外层的生成器的反卷积层输出仿真场景的图像。其中,内层的不同生成器对应的下采样处理的倍数可以不同。本实施例中,嵌套在内的生成器的输入需要进行下采样,使得输出的分辨率减小,从而关注输出的整体信息。最外层的生成器的反卷积层的输出即生成对抗网络的输出,分辨率较高,关注输出的细节信息。综合来看,生成对抗网络输出的仿真场景的图像既关注了整体又关注了细节。
在一些实施例中,生成对抗网络的架构如图6所示,由N(N≥3)个生成器嵌套构成,从内到外分别记为生成器1,生成器2,……,生成器N。每个生成器均包括卷积层和反卷积层。生成对抗网络的输入信息为语义分割信息、实例分割信息和特征图。 生成对抗网络的输出信息为仿真场景的图像,也即生成器N的反卷积层的输出为仿真场景的图像。
图6中,以N=3为例,生成对抗网络的输入信息输入到生成器N的卷积层。生成对抗网络的输入信息进行下采样处理后输入到生成器2的卷积层。生成对抗网络的输入信息再次进行下采样处理后输入到生成器1的卷积层。其中,下采样的目的是缩小分辨率,例如,缩小比例为1/2×1/2;若生成器N的输出为200×200分辨率,则生成器2的输出为100×100分辨率,生成器1的输出为50×50分辨率。可见,生成器N的分辨率高,更关注细节;生成器2与生成器1分辨率低,更关注整体。因此,生成对抗网络输出的高清图像更加合理,既关注了整体,又关注了细节。
在一些实施例中,生成单元303所使用的生成对抗网络、嵌入网络、条件增强网络和自编码网络通过联合训练得到。在一些实施例中,联合训练可包括:获取样本场景的语义分割信息、实例分割信息、实例文本信息和样本图像;进而基于语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练。
在一些实施例中,生成单元303所使用的生成对抗网络、嵌入网络、条件增强网络和自编码网络基于语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练,具体为:
将实例文本信息输入嵌入网络,嵌入网络的输出通过条件增强网络,得到处理结果;将样本图像输入自编码网络的编码器进行编码处理,得到每张样本图像对应的隐变量;自编码网络将每张样本图像对应的隐变量进行采样,得到采样结果;自编码网络的解码器将处理结果和采样结果进行解码处理,生成特征图;将语义分割信息、实例分割信息和特征图输入到生成对抗网络最外层的生成器的卷积层;将语义分割信息、实例分割信息和特征图进行下采样处理后输入到生成对抗网络内层的生成器的卷积层;生成对抗网络最外层的生成器的反卷积层输出生成图像;将生成图像、样本图像、语义分割信息、实例分割信息和特征图,通过判别网络,完成训练。
在一些实施例中,生成对抗网络输出的生成图像是伪图片,作为训练数据,其特征值标记为“fake”,用以表示它们是随机生成的图片而非真实图片。相对地,样本图像是真实拍摄的图片,其特征值可以标记为“real”。将生成图像、样本图像、语义分割信息、实例分割信息和特征图,通过判别网络,不断的迭代,使得判别网络能更加 精准地判断真实图片与伪图片,从而对生成对抗网络进行反馈,使得生成对抗网络生成以假乱真的图片。
由于生成对抗网络被证实具有收敛的性质,判别网络在未满足每一个判别器的判断概率值均收敛至0.5时,可以继续进行训练,直到通过多次的迭代满足训练目标。所述的“训练目标”可以为预设的生成对抗网络生成图片是否满足要求的目标。在一实施例中,由于函数的收敛性质,生成对抗网络的训练目标例如可以是所预测的图片的特征值满足指定要求,例如接近0.5。在判断出满足收敛至0.5后停止训练。
在一些实施例中,判别网络由级联的多个判别器构成;最上级的判别器的输入为生成图像、样本图像、语义分割信息、实例分割信息和特征图;生成图像、样本图像、语义分割信息、实例分割信息和特征图经过下采样处理后输入下级的判别器;其中,不同级的判别器对应的下采样处理的倍数可以不同。
在一些实施例中,判别网络的架构如图7所示,由N(N≥3)个判别器级联构成,从上到下分别记为判别器1,判别器2,……,判别器N。判别网络的输入信息为生成图像、样本图像、语义分割信息、实例分割信息和特征图。判别网络的输出信息为判断概率值。
图7中,以N=3为例,判别网络的输入信息输入到判别器1。判别网络的输入信息进行下采样处理后输入到判别器2。判别网络的输入信息再次进行下采样处理后输入到判别器N。若判别器1、判别器2和判别器3输出的判断概率值均收敛到0.5,联合训练结束。
在一些实施例中,仿真场景图像生成系统300可以为软件系统、硬件系统或者软硬件结合的系统。例如,仿真场景图像生成系统300是运行在操作系统上的软件系统,电子设备的硬件系统是支持操作系统运行的硬件系统。
在一些实施例中,仿真场景图像生成系统300中各单元的划分仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如获取单元301、接收单元302和生成单元303可以实现为一个单元;获取单元301、接收单元302或生成单元303也可以划分为多个子单元。可以理解的是,各个单元或子单元能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方 法来实现所描述的功能。
图4为本公开实施例提供的一种仿真场景的图像生成方法流程图。该方法的执行主体为电子设备,在一些实施例中,该方法的执行主体为电子设备中运行的仿真场景图像生成系统;或者该方法的执行主体为电子设备中运行的仿真系统,其中,仿真场景图像生成系统可以为仿真系统的一部分。
如图4所示,仿真场景的图像生成方法可包括但不限于如下步骤401至403:
401、获取场景白模的语义分割信息和实例分割信息。其中,场景白模可以理解为没有添加颜色、纹理、光照等属性信息的场景模型。在一些实施例中,场景白模由仿真引擎建立,且场景白模的语义分割信息和实例分割信息由所述仿真引擎基于场景白模生成。例如,场景白模由人工在仿真引擎中建立,且无需人工添加颜色、纹理、光照等属性信息;仿真引擎基于场景白模可自动生成语义分割信息和实例分割信息。
在一些实施例中,语义分割信息用于区分或描述仿真场景中不同类别的物体:人、车、动物、建筑等;实例分割信息用于区分或描述仿真场景中每个物体:不同的人、不同的车、不同的动物、不同的建筑等。也即,对于仿真场景中的一个物体,语义分割信息表明该物体是人还是车;如果是车,实例分割信息表明该车是奥迪还是大众;实例文本信息表明该车是白车还是黑车。
402、接收所述场景白模的实例文本信息;所述实例文本信息为可编辑的信息,且用于描述实例的属性。通过改变实例文本信息的内容,实现对实例属性的编辑,不同实例属性对应不同的实例。在一些实施例中,场景白模的实例文本信息由人工输入,且人工输入实例文本信息的过程中,可编辑实例文本信息的内容,步骤402接收人工输入的实例文本信息。本实施例中,由于采用实例文本信息描述实例的属性,因此将实例文本信息设置为可编辑的信息,实现了对实例的属性的可编辑性。进而仿真场景是具备实例属性可编辑性的场景。在一些实施例中,实例的属性可包括但不限于颜色、纹理、光照等。
403、基于所述语义分割信息、所述实例分割信息、所述实例文本信息和预先训练的生成对抗网络,生成仿真场景的图像。在一些实施例中,实例文本信息并非直接作为生成对抗网络的输入,而是基于实例文本信息和场景白模对应的至少一张真实图像,生成特征图。进而基于语义分割信息、实例分割信息和特征图,通过预先训练的生成 对抗网络,生成仿真场景的图像。在一些实施例中,将语义分割信息、实例分割信息和特征图级联(实质为向量级联)后输入预先训练的生成对抗网络,生成仿真场景的图像。
本实施例中,由特征图输入生成对抗网络来调整场景中实例的颜色、纹理、光照等属性。另外,生成仿真场景的图像为高分辨率图像,仿真场景为高分辨率场景,便于人工智能技术研发过程中技术探索和技术验证测试。
在一些实施例中,基于实例文本信息和场景白模对应的至少一张真实图像,生成特征图,具体为:将实例文本信息进行嵌入处理和条件增强处理,得到处理结果;将场景白模对应的至少一张真实图像进行编码处理,得到每张真实图像对应的隐变量,其中,隐变量可以理解为中间变量,一张图像对应一个隐变量;将每张真实图像对应的隐变量进行采样,得到采样结果,其中,通过隐变量采样来调整仿真场景中实例的属性信息,实现仿真场景的图像的多样化;将处理结果和采样结果进行解码处理,生成特征图。
在一些实施例中,将实例文本信息进行嵌入处理和条件增强处理,得到处理结果,具体为:将实例文本信息输入预先训练的嵌入(Embedding)网络,嵌入网络的输出通过预先训练的条件增强(Conditioning Augmentation)网络,得到处理结果。其中,嵌入网络和条件增强网络均为神经网络且网络参数通过预先训练得到。
在一些实施例中,将场景白模对应的至少一张真实图像输入预先训练的自编码网络的编码器进行编码处理,得到每张真实图像对应的隐变量;自编码网络将每张真实图像对应的隐变量进行采样,得到采样结果;自编码网络的解码器将处理结果和采样结果进行解码处理,生成特征图。在一些实施例中,自编码网络为变分自编码网络。
在一些实施例中,自编码网络的架构如图5所示,包括卷积层和反卷积层,其中,卷积层可以理解为自编码网络的编码器,反卷积层可以理解为自编码网络的解码器。自编码网络的输入信息为场景白模对应的至少一张真实图像,也即自编码网络的卷积层的输入为场景白模对应的至少一张真实图像。自编码网格的输出信息为特征图,也即自编码网络的反卷积层的输出为特征图。
图5中,实例文本信息输入预先训练的嵌入网络,嵌入网络的输出为一组低维度向量,嵌入网络的输出通过预先训练的条件增强网络,得到处理结果。自编码网络将 每张真实图像对应的隐变量进行采样,得到采样结果。处理结果和采样结果级联(实质为向量级联)后输入到自编码网络的反卷积层进行解码处理,生成特征图。
在一些实施例中,生成对抗网络,包括生成网络和判别网络,其中,生成网络由嵌套的多个生成器构成,其中,生成器包括卷积层和反卷积层,且嵌套在内的生成器的反卷积层的最后一层特征图输出作为嵌套在外的生成器的反卷积层的输入。
需要说明的是,由于判别网络主要用于训练生成网络,生成网络训练完成后,会独立生成仿真场景的图像,因此,下文中,在描述生成网络的功能时,均采用了生成对抗网络替代生成网络,也即生成对抗网络生成仿真场景的图像,本领域技术人员可以理解是由生成对抗网络的生成网络生成仿真场景的图像。在描述联合训练时,单独提及判别网络,不代表判别网络不属于生成对抗网络。
在一些实施例中,将语义分割信息、实例分割信息和特征图输入到生成对抗网络最外层的生成器的卷积层;将语义分割信息、实例分割信息和特征图进行下采样处理后输入到生成对抗网络内层的生成器的卷积层;生成对抗网络最外层的生成器的反卷积层输出仿真场景的图像。其中,内层的不同生成器对应的下采样处理的倍数可以不同。本实施例中,嵌套在内的生成器的输入需要进行下采样,使得输出的分辨率减小,从而关注输出的整体信息。最外层的生成器的反卷积层的输出即生成对抗网络的输出,分辨率较高,关注输出的细节信息。综合来看,生成对抗网络输出的仿真场景的图像既关注了整体又关注了细节。
在一些实施例中,生成对抗网络的架构如图6所示,由N(N≥3)个生成器嵌套构成,从内到外分别记为生成器1,生成器2,……,生成器N。每个生成器均包括卷积层和反卷积层。生成对抗网络的输入信息为语义分割信息、实例分割信息和特征图。生成对抗网络的输出信息为仿真场景的图像,也即生成器N的反卷积层的输出为仿真场景的图像。
图6中,以N=3为例,生成对抗网络的输入信息输入到生成器N的卷积层。生成对抗网络的输入信息进行下采样处理后输入到生成器2的卷积层。生成对抗网络的输入信息再次进行下采样处理后输入到生成器1的卷积层。其中,下采样的目的是缩小分辨率,例如,缩小比例为1/2×1/2;若生成器N的输出为200×200分辨率,则生成器2的输出为100×100分辨率,生成器1的输出为50×50分辨率。可见,生成器N的分 辨率高,更关注细节;生成器2与生成器1分辨率低,更关注整体。因此,生成对抗网络输出的高清图像更加合理,既关注了整体,又关注了细节。
在一些实施例中,生成对抗网络、嵌入网络、条件增强网络和自编码网络通过联合训练得到。在一些实施例中,联合训练可包括:获取样本场景的语义分割信息、实例分割信息、实例文本信息和样本图像;进而基于语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练。
在一些实施例中,生成对抗网络、嵌入网络、条件增强网络和自编码网络基于语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练,具体为:
将实例文本信息输入嵌入网络,嵌入网络的输出通过条件增强网络,得到处理结果;将样本图像输入自编码网络的编码器进行编码处理,得到每张样本图像对应的隐变量;自编码网络将每张样本图像对应的隐变量进行采样,得到采样结果;自编码网络的解码器将处理结果和采样结果进行解码处理,生成特征图;将语义分割信息、实例分割信息和特征图输入到生成对抗网络最外层的生成器的卷积层;将语义分割信息、实例分割信息和特征图进行下采样处理后输入到生成对抗网络内层的生成器的卷积层;生成对抗网络最外层的生成器的反卷积层输出生成图像;将生成图像、样本图像、语义分割信息、实例分割信息和特征图,通过判别网络,完成训练。
在一些实施例中,生成对抗网络输出的生成图像是伪图片,作为训练数据,其特征值标记为“fake”,用以表示它们是随机生成的图片而非真实图片。相对地,样本图像是真实拍摄的图片,其特征值可以标记为“real”。将生成图像、样本图像、语义分割信息、实例分割信息和特征图,通过判别网络,不断的迭代,使得判别网络能更加精准地判断真实图片与伪图片,从而对生成对抗网络进行反馈,使得生成对抗网络生成以假乱真的图片。
由于生成对抗网络被证实具有收敛的性质,判别网络在未满足每一个判别器的判断概率值均收敛至0.5时,可以继续进行训练,直到通过多次的迭代满足训练目标。所述的“训练目标”可以为预设的生成对抗网络生成图片是否满足要求的目标。在一实施例中,由于函数的收敛性质,生成对抗网络的训练目标例如可以是所预测的图片的特征值满足指定要求,例如接近0.5。在判断出满足收敛至0.5后停止训练。
在一些实施例中,判别网络由级联的多个判别器构成;最上级的判别器的输入为 生成图像、样本图像、语义分割信息、实例分割信息和特征图;生成图像、样本图像、语义分割信息、实例分割信息和特征图经过下采样处理后输入下级的判别器;其中,不同级的判别器对应的下采样处理的倍数可以不同。
在一些实施例中,判别网络的架构如图7所示,由N(N≥3)个判别器级联构成,从上到下分别记为判别器1,判别器2,……,判别器N。判别网络的输入信息为生成图像、样本图像、语义分割信息、实例分割信息和特征图。判别网络的输出信息为判断概率值。
图7中,以N=3为例,判别网络的输入信息输入到判别器1。判别网络的输入信息进行下采样处理后输入到判别器2。判别网络的输入信息再次进行下采样处理后输入到判别器N。若判别器1、判别器2和判别器3输出的判断概率值均收敛到0.5,联合训练结束。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员能够理解,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其他顺序或者同时进行(“获取场景白模的语义分割信息和实例分割信息”与“接收场景白模的实例文本信息”可以同时进行;也可以先“接收场景白模的实例文本信息”,再“获取场景白模的语义分割信息和实例分割信息”)。另外,本领域技术人员能够理解,说明书中所描述的实施例均属于可选实施例。
本公开实施例还提出一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储程序或指令,所述程序或指令使计算机执行如仿真场景的图像生成方法各实施例的步骤,为避免重复描述,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本公开的范 围之内并且形成不同的实施例。
本领域的技术人员能够理解,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
虽然结合附图描述了本公开的实施方式,但是本领域技术人员可以在不脱离本公开的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。
工业实用性
本公开实施例中,只需建立场景白模,进而基于场景白模的语义分割信息和实例分割信息,可生成仿真场景的图像,无需在场景建立过程中细化颜色、纹理、光照等属性,提高生成效率;并且,实例文本信息可编辑,不同实例文本信息描述不同实例的属性,对应不同的实例,使得仿真场景多样化,具有工业实用性。

Claims (14)

  1. 一种仿真场景的图像生成方法,其特征在于,所述方法包括:
    获取场景白模的语义分割信息和实例分割信息;
    接收所述场景白模的实例文本信息;所述实例文本信息为可编辑的信息,且用于描述实例的属性;
    基于所述语义分割信息、所述实例分割信息、所述实例文本信息和预先训练的生成对抗网络,生成仿真场景的图像。
  2. 根据权利要求1所述的方法,其特征在于,所述场景白模由仿真引擎建立,且所述场景白模的语义分割信息和实例分割信息由所述仿真引擎基于所述场景白模生成。
  3. 根据权利要求1所述的方法,其特征在于,基于所述语义分割信息、所述实例分割信息、所述实例文本信息和预先训练的生成对抗网络,生成仿真场景的图像,包括:
    基于所述实例文本信息和所述场景白模对应的至少一张真实图像,生成特征图;
    基于所述语义分割信息、所述实例分割信息和所述特征图,通过预先训练的生成对抗网络,生成仿真场景的图像。
  4. 根据权利要求3所述的方法,其特征在于,基于所述实例文本信息和所述场景白模对应的至少一张真实图像,生成特征图,包括:
    将所述实例文本信息进行嵌入处理和条件增强处理,得到处理结果;
    将所述场景白模对应的至少一张真实图像进行编码处理,得到每张真实图像对应的隐变量;
    将每张真实图像对应的隐变量进行采样,得到采样结果;
    将所述处理结果和所述采样结果进行解码处理,生成特征图。
  5. 根据权利要求4所述的方法,其特征在于,将所述实例文本信息进行嵌入处理和条件增强处理,得到处理结果,包括:
    将所述实例文本信息输入预先训练的嵌入网络,所述嵌入网络的输出通过预先训 练的条件增强网络,得到处理结果。
  6. 根据权利要求5所述的方法,其特征在于,
    将所述场景白模对应的至少一张真实图像输入预先训练的自编码网络的编码器进行编码处理,得到每张真实图像对应的隐变量;
    所述自编码网络将每张真实图像对应的隐变量进行采样,得到采样结果;
    所述自编码网络的解码器将所述处理结果和所述采样结果进行解码处理,生成特征图。
  7. 根据权利要求3所述的方法,其特征在于,所述生成对抗网络由嵌套的多个生成器构成,其中,所述生成器包括卷积层和反卷积层,且嵌套在内的生成器的反卷积层的最后一层特征图输出作为嵌套在外的生成器的反卷积层的输入。
  8. 根据权利要求7所述的方法,其特征在于,基于所述语义分割信息、所述实例分割信息和所述特征图,通过预先训练的生成对抗网络,生成仿真场景的图像,包括:
    将所述语义分割信息、所述实例分割信息和所述特征图输入到所述生成对抗网络最外层的生成器的卷积层;
    将所述语义分割信息、所述实例分割信息和所述特征图进行下采样处理后输入到所述生成对抗网络内层的生成器的卷积层;其中,内层的不同生成器对应的下采样处理的倍数不同;
    所述生成对抗网络最外层的生成器的反卷积层输出仿真场景的图像。
  9. 根据权利要求6所述的方法,其特征在于,所述生成对抗网络、所述嵌入网络、所述条件增强网络和所述自编码网络通过联合训练得到。
  10. 根据权利要求9所述的方法,其特征在于,所述联合训练,包括:
    获取样本场景的语义分割信息、实例分割信息、实例文本信息和样本图像;
    基于所述语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练。
  11. 根据权利要求10所述的方法,其特征在于,基于所述语义分割信息、实例分割信息、实例文本信息和样本图像进行联合训练,包括:
    将所述实例文本信息输入所述嵌入网络,所述嵌入网络的输出通过条件增强网络, 得到处理结果;
    将所述样本图像输入所述自编码网络的编码器进行编码处理,得到每张样本图像对应的隐变量;
    所述自编码网络将每张样本图像对应的隐变量进行采样,得到采样结果;
    所述自编码网络的解码器将所述处理结果和所述采样结果进行解码处理,生成特征图;
    将所述语义分割信息、所述实例分割信息和所述特征图输入到所述生成对抗网络最外层的生成器的卷积层;
    将所述语义分割信息、所述实例分割信息和所述特征图进行下采样处理后输入到所述生成对抗网络内层的生成器的卷积层;
    所述生成对抗网络最外层的生成器的反卷积层输出生成图像;
    将所述生成图像、所述样本图像、所述语义分割信息、所述实例分割信息和所述特征图,通过判别网络,完成训练。
  12. 根据权利要求11所述的方法,其特征在于,所述判别网络由级联的多个判别器构成;
    最上级的判别器的输入为所述生成图像、所述样本图像、所述语义分割信息、所述实例分割信息和所述特征图;
    所述生成图像、所述样本图像、所述语义分割信息、所述实例分割信息和所述特征图经过下采样处理后输入下级的判别器;其中,不同级的判别器对应的下采样处理的倍数不同。
  13. 一种电子设备,其特征在于,包括:处理器和存储器;
    所述处理器通过调用所述存储器存储的程序或指令,用于执行如权利要求1至12任一项所述方法的步骤。
  14. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储程序或指令,所述程序或指令使计算机执行如权利要求1至12任一项所述方法的步骤。
PCT/CN2019/120408 2019-11-22 2019-11-22 一种仿真场景的图像生成方法、电子设备和存储介质 WO2021097845A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980002612.5A CN110998663B (zh) 2019-11-22 2019-11-22 一种仿真场景的图像生成方法、电子设备和存储介质
PCT/CN2019/120408 WO2021097845A1 (zh) 2019-11-22 2019-11-22 一种仿真场景的图像生成方法、电子设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/120408 WO2021097845A1 (zh) 2019-11-22 2019-11-22 一种仿真场景的图像生成方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021097845A1 true WO2021097845A1 (zh) 2021-05-27

Family

ID=70080461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120408 WO2021097845A1 (zh) 2019-11-22 2019-11-22 一种仿真场景的图像生成方法、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN110998663B (zh)
WO (1) WO2021097845A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673369A (zh) * 2021-07-30 2021-11-19 中国科学院自动化研究所 遥感图像场景规划方法、装置、电子设备和存储介质
CN114491694A (zh) * 2022-01-17 2022-05-13 北京航空航天大学 一种基于虚幻引擎的空间目标数据集构建方法
CN115641400A (zh) * 2022-11-04 2023-01-24 广州大事件网络科技有限公司 一种动态图片生成方法、系统、设备及存储介质
CN117953108A (zh) * 2024-03-20 2024-04-30 腾讯科技(深圳)有限公司 图像生成方法、装置、电子设备和存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149545B (zh) * 2020-09-16 2024-04-09 珠海格力电器股份有限公司 样本生成方法、装置、电子设备及存储介质
CN116310659B (zh) * 2023-05-17 2023-08-08 中数元宇数字科技(上海)有限公司 训练数据集的生成方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564126A (zh) * 2018-04-19 2018-09-21 郑州大学 一种融合语义控制的特定场景生成方法
CN109472365A (zh) * 2017-09-08 2019-03-15 福特全球技术公司 使用辅助输入通过生成式对抗网络来细化合成数据
US20190266442A1 (en) * 2018-02-28 2019-08-29 Fujitsu Limited Tunable generative adversarial networks
CN110378284A (zh) * 2019-07-18 2019-10-25 北京京东叁佰陆拾度电子商务有限公司 道路正视图生成方法及装置、电子设备、存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4857601B2 (ja) * 2005-05-17 2012-01-18 凸版印刷株式会社 シミュレーション画像生成装置、方法、演算プログラム、及びそのプログラムを記録した記録媒体
JP6732214B2 (ja) * 2017-03-10 2020-07-29 オムロン株式会社 画像処理装置、画像処理方法、テンプレート作成装置、物体認識処理装置及びプログラム
CN109447897B (zh) * 2018-10-24 2023-04-07 文创智慧科技(武汉)有限公司 一种真实场景图像合成方法及系统
CN109360231B (zh) * 2018-10-25 2022-01-07 哈尔滨工程大学 基于分形深度卷积生成对抗网络的海冰遥感图像仿真方法
CN110111335B (zh) * 2019-05-08 2021-04-16 南昌航空大学 一种自适应对抗学习的城市交通场景语义分割方法及系统
CN110443863B (zh) * 2019-07-23 2023-04-07 中国科学院深圳先进技术研究院 文本生成图像的方法、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472365A (zh) * 2017-09-08 2019-03-15 福特全球技术公司 使用辅助输入通过生成式对抗网络来细化合成数据
US20190266442A1 (en) * 2018-02-28 2019-08-29 Fujitsu Limited Tunable generative adversarial networks
CN108564126A (zh) * 2018-04-19 2018-09-21 郑州大学 一种融合语义控制的特定场景生成方法
CN110378284A (zh) * 2019-07-18 2019-10-25 北京京东叁佰陆拾度电子商务有限公司 道路正视图生成方法及装置、电子设备、存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673369A (zh) * 2021-07-30 2021-11-19 中国科学院自动化研究所 遥感图像场景规划方法、装置、电子设备和存储介质
CN114491694A (zh) * 2022-01-17 2022-05-13 北京航空航天大学 一种基于虚幻引擎的空间目标数据集构建方法
CN115641400A (zh) * 2022-11-04 2023-01-24 广州大事件网络科技有限公司 一种动态图片生成方法、系统、设备及存储介质
CN115641400B (zh) * 2022-11-04 2023-11-17 广州大事件网络科技有限公司 一种动态图片生成方法、系统、设备及存储介质
CN117953108A (zh) * 2024-03-20 2024-04-30 腾讯科技(深圳)有限公司 图像生成方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN110998663A (zh) 2020-04-10
CN110998663B (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
WO2021097845A1 (zh) 一种仿真场景的图像生成方法、电子设备和存储介质
CN111860425B (zh) 一种深度多模态跨层交叉融合方法、终端设备及存储介质
WO2020191642A1 (zh) 轨迹预测方法及装置、存储介质、驾驶系统与车辆
US11132211B1 (en) Neural finite state machines
CN110383340A (zh) 使用稀疏体积数据进行路径规划
US11940803B2 (en) Method, apparatus and computer storage medium for training trajectory planning model
CN110415516A (zh) 基于图卷积神经网络的城市交通流预测方法及介质
CN114084155B (zh) 预测型智能汽车决策控制方法、装置、车辆及存储介质
CN113468978B (zh) 基于深度学习的细粒度车身颜色分类方法、装置和设备
CN113139446B (zh) 一种端到端自动驾驶行为决策方法、系统及终端设备
WO2022121207A1 (zh) 轨迹规划方法、装置、设备、存储介质和程序产品
CN111142402B (zh) 仿真场景构建方法、装置和终端
WO2023123906A1 (zh) 交通信号灯控制方法及相关设备
CN111860411A (zh) 一种基于注意力残差学习的道路场景语义分割方法
CN111553242B (zh) 用于预测驾驶行为的生成对抗网络的训练方法和电子设备
CN111460879B (zh) 利用网格生成器的神经网络运算方法及使用该方法的装置
Oh et al. Hcnaf: Hyper-conditioned neural autoregressive flow and its application for probabilistic occupancy map forecasting
CN115249266A (zh) 航路点位置预测方法、系统、设备及存储介质
CN115346193A (zh) 一种车位检测方法及其跟踪方法、车位检测装置、车位检测设备及计算机可读存储介质
Hou et al. Integrated graphical representation of highway scenarios to improve trajectory prediction of surrounding vehicles
Chen et al. Real-time lane detection model based on non bottleneck skip residual connections and attention pyramids
WO2023226781A1 (zh) 地图生成方法和相关产品
CN117197296A (zh) 交通道路场景模拟方法、电子设备及存储介质
Yi et al. Improving synthetic to realistic semantic segmentation with parallel generative ensembles for autonomous urban driving
CN116383410A (zh) 灯路关系知识图谱构建、关系预测、自动驾驶方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19953638

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19953638

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19953638

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.12.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19953638

Country of ref document: EP

Kind code of ref document: A1