CN116563423A

CN116563423A - Scene rendering method and device for fine-grained semantic control

Info

Publication number: CN116563423A
Application number: CN202310346851.0A
Authority: CN
Inventors: 余建兴; 王世祺; 董晓; 张宇锋; 崔岩; 印鉴
Original assignee: Sun Yat Sen University; 4Dage Co Ltd
Current assignee: Sun Yat Sen University; 4Dage Co Ltd
Priority date: 2023-03-30
Filing date: 2023-04-03
Publication date: 2023-08-08

Abstract

The invention relates to the technical field of scene rendering, and discloses a scene rendering method for fine-granularity semantic control. In addition, the demand text is converted into a dependency syntax tree, and an inference module is assembled on the tree, so that the association between scene graph features and text features is enhanced, and the positioning is more accurate. The invention also designs a regularization device to restrict the scene generation model, which can accurately modify the visual information of the target area without affecting other irrelevant areas. According to the method, the editing area is not required to be manually defined, the text requirement can be directly understood, and fine-granularity local editing can be performed on the original scene graph. The user can put forward the editing requirement in a text form, and add, delete and modify the scene graph, so that convenient and controllable rendering is realized.

Description

Scene rendering method and device for fine-grained semantic control

Technical Field

The invention relates to the technical field of scene rendering, in particular to a scene rendering method and device for fine-granularity semantic control.

Background

With the rapid development of artificial intelligence and the continuous improvement of living standard of people, the demands of space designs such as indoor and outdoor scenes are continuously increased. Visual scene design has become an important application scene, and corresponding rendering technology is also gradually gaining wide attention of researchers. Whether in the art, scientific or engineering, computer technology is increasingly being applied to the research of scene design. The traditional scene design flow generally firstly analyzes the requirements of users, and a designer assembles the scene according to the requirements to construct the scene, and finally displays the visualized results to the users through rendering. Where rendering involves a large number of computational steps, requires the use of high performance machines and takes a large amount of run time, and a partially complex scene may even take days. Also in the field of spatial design, the user's design is often ambiguous. Some users may not even be able to determine specific design requirements. They tend to view the effect map and make modifications to confirm the design. For example, in indoor scene designs, the user is not satisfied with a certain wall and requires modification to a window to increase light. For this requirement, the designer needs to manually define the area to be modified and adjust the relevant scene graph components, such as finding the area of the wall that is replaced by a small tile of a window, and then re-rendering the global scene for output. Aiming at the variable demands of users, the global rendering can increase the design cost by times, and the development of scene design to simplification and civilian improvement is limited. One solution is to provide interactive scene design and quick preview functions, to intelligentize indoor designs and to speed scene rendering. The user only needs to directly submit the text-form modification requirement description, the machine can understand the requirement of the user through a natural language processing technology, automatically locate the region to be modified, edit the region and then render the result. This way of local rendering, rather than global rendering, may effectively reduce costs. However, this easy way involves two types of challenges, including how to understand the text description of the user to accurately locate the operation region, and problems of incompatibility, etc. of the local rendering results with the surrounding region.

Conventional rendering techniques are typically two classical rendering methods, rasterization and ray tracing. The rasterization method is mainly applied to real-time rendering. The method comprises the steps of firstly mapping each vertex in a three-dimensional scene onto a two-dimensional screen, then rasterizing the two-dimensional vertices into individual fragments of the screen, and coloring each fragment based on illumination calculation. Finally, the color-finished fragments are sent to a color buffer area to be changed into colors on a screen through the switching chain operation of the graphic interface after being sequenced and discarded. Compared with the rasterization method, the rendering method based on ray tracing is more realistic, but the calculation time is also greatly improved due to the geometric increase of the number of rays. Such methods track the rays projected by the camera toward the scene in an attempt to find a possible path between the object and the light source to simulate the real appearance of light in the medium to output a rendered image. Such rendering is typically an update to the entire scene graph, and it is difficult to modify the parts. This global update approach is costly when dealing with user-variable modification requirements.

To solve this problem, a simple way is to use locally editable rendering techniques. For this task, the currently prevailing methods can be categorized into two categories. The first type of method requires the user to manually delineate the area to be updated and then edit the area. The manner of editing may be by searching the relevant content through a search engine for replacement. Furthermore, the association between the visual information of the regions can be learned by means of a neural network, i.e. mapped into a high-dimensional potential space, so as to find out the relevant content for editing. However, the search method based on replacement is difficult to make the edited region compatible with the adjacent region, and the problems of uncoordinated connection, too hard connection and the like are easy to occur. Whereas neural network-based methods often rely on posterior knowledge to capture correspondence between potential encodings and scene graph regions. Since the process of identifying latent semantics is greatly affected by the analysis of samples, the use of different samples results in different identification structures, which can lead to model instability. Moreover, finding subspaces with rich semantics in a given high-dimensional potential space is a very challenging task. Existing steering models are typically linear, which limits the diversity of editing results. In addition to this manual delineation of regions, a second class of editable rendering techniques allows a user to automatically locate edit regions merely by providing a description of the need for text. This can simplify the process, reduce the use threshold of the technique, thereby improving the user's experience. To achieve such functionality, text understanding techniques and image object detection technical support are required. Currently, the mainstream target detection technology can be divided into two types of methods, namely single-stage detection and two-stage detection. Single-stage detection models use dense sampling to classify targets directly in a single pass and use predefined boxes or keypoints of various ratios and aspect ratios to locate targets. The two-stage detection model attempts to find any number of object suggestions in the image in a first stage and then classifies and locates them in a second stage.

At present, semantic steerable rendering techniques are very weak. Only a few related techniques have been investigated, such as image editing techniques, which enable modification of images based on text. To achieve this goal, CN114612290a matches the required text with a preset text template to determine an instruction feature code that is used to add to the image feature code to be edited to complete the modification. CN114612290a designs a corresponding image editing network for different text templates, thereby accelerating the model training process. These methods have limited modification operations that can be supported and have insufficient scalability due to the need to pre-determine the required text templates. For this reason, the CN115035213a directly fuses the image feature code to be edited and the required text feature code through the residual block to obtain a hidden layer image feature code, and then adds the hidden layer image feature code and the original image feature code to complete the modification of the image information. However, these methods generally take text and images as data features to input the model, directly generating the results. Because the details are not analyzed and understood, the method is difficult to support fine-grained editing, and the problems of poor editing effect, mismatching of the obtained image and the demand text description and the like are easily caused. For example, it is difficult to determine whether a user is about to modify a "sofa" or a "wall" because the terms "right" and "above" are not understood. Moreover, the mapping relationship between the text and the image and the modification result needs to be learned to generate the result, which requires a large amount of labeling data for training, and the labeling data can be obtained at high cost.

Disclosure of Invention

The invention aims to provide a scene rendering method and device capable of understanding fine-grained requirements in user descriptions, automatically identifying things needing to be changed in a scene graph, generating proper new content and rendering fine-grained semantic control of results.

In order to achieve the above object, the present invention provides a scene rendering method for fine granularity semantic manipulation, comprising the steps of:

s1: inputting the scene graph I into a target detection unit, positioning each entity object in the scene graph I by the target detection unit, coding visual information of an area where the entity is located, and outputting an area feature coding set V= { V ₁ ,…,v _M M is the number of detected regions, the elements in VVisual information describing the physical object in the ith region and the physical category to which it belongs, where d _v Encoding dimensions for the region features;

s2: will demand text q= { Q ₁ ,…,q _m The semantic understanding unit inputs the required text q= { Q by using the grammar parser ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes isIt is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +.>Understanding the user's intention to generate the corresponding manipulation instruction code +. >Wherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v Consistent, the dependency syntax tree node characteristic set H is used as reference information to assist a post-positioned positioning unit to accurately position the area to be updated;

s3: region feature code set v= { V from target detection unit ₁ ,…,v _M Dependency syntax tree node feature code set for } and demand text from semantic understanding unitInput a localization inference unit which calculates about +/for each image region i>Location score S (v) _i H) so as to determine the one with the highest score as the edit area,

the positioning reasoning unit adopts a tree-shaped modularized network to normalize the visual positioning process into a dependency syntax tree, a neural module network for calculating temporary positioning scores of areas is assembled for each node in the tree, the positioning scores of each area are finally obtained by integrating the temporary scores from bottom to top, an area l with the highest positioning score is selected as an area to be edited, and the characteristics of the area l to be edited are coded as v _l ；

S4: identifying editing operation op in text from semantic understanding unit and feature code v from region to be edited/determined by positioning reasoning unit _l An input content rendering unit that encodes v the characteristics of the region to be edited/based on the editing operation op identified from the text _l Modifying, and inputting the modified characteristics into a generator for rendering; the content rendering unit takes a GAN network as a generator, and designs a regularizer to train the generator, namely, adds a control instruction code op into a feature code v of a region to be edited _l In the method, the modified regional characteristic code set is input into a GAN network to output a rendered sceneWherein alpha is a preset parameter, and the alpha is a preset parameter,

preferably, in step S1, the object detection unit comprises a backbone feature extraction network, a high-order feature modeling network and an object detector,

the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full convolution neural network; CSPNet inputs original input feature V ₀ Divided into two parts V' ₀ And V' ₀ And input into a cross-phase hierarchy with dual paths for merging; the dark net53 full convolutional neural network contains 53 convolutional layers, each followed by a batch regularization layer and an activation layer; the DarkNet53 full convolution neural network has no pooling layer, and a convolution layer with a stride of 2 is used for replacing the pooling layer to carry out the down sampling process of the feature map; the DarkNet53 full convolution neural network also comprises 5 CSPNet modules;

The high-order characteristic modeling network consists of an SPP network and a PANet network; the SPP network carries out maximum pooling operation under multiple scales on the feature map, improves the receptive field of the network, and discovers important context information in the scene map; the PANet network adds a bottom-up shortcut on the basis of the feature pyramid network FPN, so that fine-grained local information can be directly used for the top layer;

the target detector adopts an anchor frame detection algorithm to predict a target entity, consists of three YOLO heads, uses a K-means algorithm to cluster sample targets to obtain the prior frame size, and further calculates the size and the position of a prediction frame where a target object is positioned by using a relative offset; and the CIOU error is used instead, the real frame where the target object is located is set as a, the calculated prediction frame is set as b, and the calculation of the CIOU error is as follows:

wherein IOU is provided with _(a,b) For the intersection ratio, ρ, of the real frame and the predicted frame ² (a, b) is the euclidean distance between the center points of the real and predicted frames, d is the diagonal distance of the smallest frame containing the real and predicted frames, (gw, gh) and (pw, ph) are the width and height of the real and predicted frames, respectively.

As a preferred scheme, a depth separable convolution network is used for replacing a standard convolution layer with 3 multiplied by 3 and step length of 1 in a CSPNet module, the depth separable convolution consists of a depth-by-depth convolution and a point-by-point convolution, wherein the depth-by-depth convolution adopts a convolution kernel with the size of k multiplied by k to process an input feature map, and the number of the convolution kernels and the channel number c of the feature map ₁ Keeping consistency; the point-by-point convolution uses a c of size 1×1 ₂ The convolution kernels integrate the results of the depth-wise convolution and change the number of channels of the final output feature map.

In the SPP network, an input feature map with the number of SPP channels being c is input into the optimized network in two parts, wherein one part of features are spliced with the multi-scale features through jump connection; in a PANet network, two consecutive residual blocks are surrounded by a jump connection to replace the 5 consecutive convolutional layers in the original network.

In step S2, the semantic understanding unit obtains a grammar tree of the required text by using a dependency grammar parser of open source, then uses a bidirectional tree form LSTM neural network to represent characteristics of nodes in the tree, the bidirectional tree form LSTM network can capture association dependency relationship of the nodes in the book, and encodes context semantic information of a parent node and a child node, specifically, for a node t in the tree, the unit uses the corresponding word q _t Is a representation vector W of (a) _emb e _t Inputting the characteristic codes into a bidirectional tree-shaped LSTM network, and finally obtaining the characteristic codes h of the node _t ＝Wherein W is _emb Embedding matrices, e, for trainable words _t One-time thermal coding for corresponding words,/- > The output results of the tree-shaped LSTM network from bottom to top and from top to bottom are respectively obtained, the average value of all node feature codes in the tree is finally taken as the global feature code q of the demand text, and the global feature code q is mapped into corresponding control instruction codes +.>Wherein->Respectively trainable weights and bias terms.

Preferably, in step S3, the positioning inference unit includes a scoring module, an integration module, and a relationship inference module, each of which updates the temporary positioning score for each region of the scene graph at the corresponding node tWherein,,

and a scoring module: the method comprises the steps of evaluating similarity of region feature codes of each scene graph and feature codes of target nodes in a tree, wherein the similarity is only assembled to leaf nodes and root nodes; first, the feature code v of each region is calculated _i And the node characteristic codes h _t Similarity S _vh (v _i ,h _t )＝fc(L2norm(fc(v _i )⊙h _t ) Where fc is the full connection layer, if the module is at the leaf node, the calculated similarity is directly used as the temporary location scoreAnd is delivered to the module assembled by the parent node; if at the root node, the similarity is added to the temporary scores of other child nodes for the regionAs final positioning evaluationDivide into N _t A node set under a node t;

And (3) an integration module: as shown in the following equation, the module integrates the temporary location score delivered by the child node to the node t, and transmits the result to the parent node,

and a relation reasoning module: the module performs compound machine reasoning according to the relation words between two grammar components in the demand text, as follows:

the module firstly aggregates the region feature codes according to the similarity between each region feature code and the node feature code to obtain the context region feature codeThe feature is used for representing the regional information for locating the scene graph according to the text relation; this contextual regional feature code is then used to calculate a temporary location score for the individual region as shown in the following equation:

wherein v is _i Feature codes representing each visual region, h _t Feature encoding representing nodes in the syntax tree; l2norm represents the L2norm and refers to the sum of squares of the elements of the vector and then the evolution. Let the rule term of L2norm W2 be as small as possible, each element of W can be made very small, close to zero and not equal to 0; the obtained model has strong anti-interference capability. The Softmax function may map the output of multiple neurons into (0, 1) intervals to support multi-class prediction.

The scoring module, the integrating module and the relation reasoning module are assembled to each node in the dependency syntax tree according to the rule, and the scoring module is assembled to all leaf nodes and root nodes first; for other nodes, it is determined whether to assemble the integration module or the relationship inference module according to the node feature codes, i.e. as shown in the following formula. Finally, the unit selects the area l with the highest positioning score at the root node module as the area to be edited.

EditArea←argmaxsoftmax(fc(h _t ))。

In step S4, the content rendering unit uses the GAN network as a generator, and designs a regularization device to restrict the model to modify only the visual information of the target area, so as not to affect other irrelevant areas; and the compatibility of the editing area and the adjacent area can be maximized through countermeasure learning.

As a preferred scheme, for the scene region feature coding set v= { V ₁ ,…,v _M Sampling from a standard Gaussian distribution N (0,I)To interfere with the region feature code v to be edited _l Sample-> To interfere with feature encoding of other regions; correspondingly, the GAN network outputs two interfering scene images, i.eAnd-> On the one hand, the disturbance of the region feature code to be edited should not affect the information of other regions, as shown in the following formula,

M ₁ Is a binary mask, and is used for covering the area to be edited, so that the change of other areas is concerned; on the other hand, the interference with other region feature codes should not affect the information of the region to be edited, as shown in the following formula,

M ₂ a binary mask for covering other areas so as to pay attention to the change of the area to be edited; finally, as shown below, the training goal of the GAN network is interference generation lossAnd generating loss->Sum of lambda ₁ 、λ ₂ In order to set the parameters to be in the preset,

the invention also provides a scene rendering device for fine granularity semantic manipulation, which comprises:

the target detection unit is used for positioning each entity object in the scene graph I, coding visual information of the region where the entity is located, and outputting a region feature coding set V= { V ₁ ,…,v _M M is the number of detected regions, the elements in VVisual information describing the physical object in the ith region and the physical category to which it belongs, where d _v Encoding dimensions for the region features;

semantic understanding unit for parsing requirements using a grammar parserText q= { Q ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes isIt is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +. >Understanding user intent to generate corresponding manipulation instruction codesWherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v Consistent, the dependency syntax tree node characteristic set H is used as reference information to assist a post-positioned positioning unit to accurately position the area to be updated;

a positioning reasoning unit for coding the set V= { V according to the region characteristics from the target detection unit ₁ ,…,v _M Dependency syntax tree node feature code set for } and demand text from semantic understanding unit Positioning the area to be edited, calculating each image area i about +.>Location score S (v) _i H) so as to determine the one with the highest score as the edit area,

the positioning reasoning unit adopts a tree-shaped modularized network to normalize the visual positioning processIn the dependency syntax tree, assembling a neural module network for calculating temporary location scores of areas for each node in the tree, finally obtaining the location score of each area by integrating the temporary scores from bottom to top, selecting the area l with the highest location score as the area to be edited, and encoding the characteristics of the area l to be edited as v _l ；

A content rendering unit for encoding v the characteristics of the region to be edited l according to the editing operation op identified from the text _l Modifying, and inputting the modified characteristics into a generator for rendering; the content rendering unit takes a GAN network as a generator, and designs a regularizer to train the generator, namely, adds a control instruction code op into a feature code v of a region to be edited _l In the method, the modified regional characteristic code set is input into a GAN network to output a rendered sceneWherein alpha is a preset parameter, and the alpha is a preset parameter,

as a preferred solution, the target detection unit includes a backbone feature extraction network, a high-order feature modeling network, and a target detector; the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full-convolution neural network, the DarkNet53 full-convolution neural network comprises 53 convolution layers, each layer is followed by a batch regularization layer and an activation layer, no pooling layer exists, a step-2 convolution layer is used for replacing the pooling layer to carry out a downsampling process of a feature map, the DarkNet53 full-convolution neural network also comprises 5 CSPNet modules, and depth separable convolution is introduced to optimize residual modules in the CSPNet structure; the high-order characteristic modeling network consists of an SPP network and a PANet network; the target detector consists of three YOLO heads;

the semantic understanding unit uses an open-source dependency grammar parser to obtain a grammar tree of the required text. Then, the characteristics of the nodes in the tree are represented by a bidirectional tree-shaped LSTM neural network;

The positioning reasoning unit comprises a scoring module, an integrating module and a relation reasoning module; the scoring module is used for evaluating the similarity of the regional feature codes of each scene graph and the feature codes of the target nodes in the tree, and the scoring module is only assembled to leaf nodes and root nodes; the integration module is used for integrating temporary positioning scores transmitted to the node t by the child node and transmitting the result to the father node; the relation reasoning module is used for executing compound machine reasoning according to the relation words between the two grammar components in the demand text.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the dependency syntax tree is constructed in the semantic understanding unit, so that the fine granularity semantics of the required text can be described, and the fine granularity information in the required text is further identified, so that the picture area is accurately positioned, and the high matching between the rendering result and the required text is realized. In addition, the demand text is converted into a dependency syntax tree, and an inference module is assembled on the tree, so that the association between scene graph features and text features is enhanced, and the positioning is more accurate. The invention also designs a regularization device to restrict the scene generation model, which can accurately modify the visual information of the target area without affecting other irrelevant areas. According to the method, the editing area is not required to be manually defined, the text requirement can be directly understood, and fine-granularity local editing can be performed on the original scene graph. The user can put forward the editing requirement in a text form, and add, delete and modify the scene graph, so that convenient and controllable rendering is realized.

Drawings

FIG. 1 is a flow chart of a scene rendering method for fine granularity semantic manipulation according to an embodiment of the present invention.

Fig. 2 is a block diagram of a scene rendering method of fine granularity semantic manipulation according to an embodiment of the present invention.

Fig. 3 is a block diagram of a backbone feature extraction network according to an embodiment of the invention.

Fig. 4 is a depth separable convolutional architecture diagram of a dark net53 full convolutional neural network in accordance with an embodiment of the present invention.

FIG. 5 is a SPP network block diagram of a high-level feature modeling network of an embodiment of the present invention.

Fig. 6 is a diagram of a PANet network architecture of a high-order feature modeling network in accordance with an embodiment of the present invention.

Fig. 7 is a structural diagram of a semantic understanding unit of an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1 to 7, a scene rendering method for fine granularity semantic manipulation according to a preferred embodiment of the present invention includes the following steps:

s2: will demand text q= { Q ₁ ,…,q _m The semantic understanding unit inputs the required text q= { Q by using the grammar parser ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes isIt is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +.>Understanding the user's intention to generate the corresponding manipulation instruction code +.>Wherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v Consistent, the dependency syntax tree node characteristic set H is used as reference information to assist a post-positioned positioning unit to accurately position the area to be updated;

S4: identifying editing operation op in text from semantic understanding unit and feature code v from region to be edited/determined by positioning reasoning unit _l An input content rendering unit that encodes v the characteristics of the region to be edited/based on the editing operation op identified from the text _l Modifying and inputting the modified characteristics into the raw materialsRendering in a composer; the content rendering unit takes a GAN network as a generator, and designs a regularizer to train the generator, namely, adds a control instruction code op into a feature code v of a region to be edited _l In the method, the modified regional characteristic code set is input into a GAN network to output a rendered scene Wherein alpha is a preset parameter, and the alpha is a preset parameter,

according to the embodiment, the dependency syntax tree is built in the semantic understanding unit, the fine granularity semantics of the required text can be described, fine granularity information in the required text is further identified, the picture area is accurately positioned, and high matching between the rendering result and the required text is achieved. In addition, the demand text is converted into a dependency syntax tree, and an inference module is assembled on the tree, so that the association between scene graph features and text features is enhanced, and the positioning is more accurate. The embodiment also designs a regularization device to restrict the scene generation model, so that the visual information of the target area can be accurately modified without affecting other irrelevant areas. According to the embodiment, the editing area is not required to be manually defined, the text requirement can be directly understood, and fine-granularity local editing can be performed on the original scene graph. The user can put forward the editing requirement in a text form, and add, delete and modify the scene graph, so that convenient and controllable rendering is realized.

Example two

The difference between the present embodiment and the implementation one is that, based on the first embodiment, the present embodiment further describes each step of the rendering method.

The scene rendering method for fine granularity semantic manipulation of the embodiment comprises the following steps:

s1: inputting the scene graph I into a target detection unit, and determining each entity object in the scene graph I by the target detection unitBits, and encodes visual information of the region where the entity is located, and outputs a region feature encoding set v= { V ₁ ,…,v _M M is the number of detected regions, the elements in VVisual information describing the physical object in the ith region and the physical category to which it belongs, where d _v The dimensions are encoded for the region features.

The object detection unit is used for identifying things existing in the scene graph, such as a wall, a table and the like; and coding the visual information of the region where the entity is located, thereby obtaining the characteristics of the scene graph region. Considering the complex structure of an indoor scene, the target elements are rich and various, and the traditional target detection algorithm is difficult to meet the actual application demands in terms of detection precision and response speed under the influence of factors such as light, angle and mutual shielding of targets. To this end, the present unit detects the object entities in the scene graph based on the YOLOv4 frame. When scene graph I is entered, the method locates physical objects in the graph, such as "sofa," "window," "table," and the like.

Specifically, in step S1, the object detection unit includes a backbone feature extraction network, a high-order feature modeling network and an object detector,

the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full convolution neural network; as shown in FIG. 3, CSPNet inputs the original input feature V ₀ Divided into two parts V' ₀ And V' ₀ And input into a cross-phase hierarchy with dual paths for merging; the dark net53 full convolutional neural network contains 53 convolutional layers, each followed by a batch regularization layer (batch normalization layer) and an activation layer (leak ReLU layer); the DarkNet53 full convolution neural network does not have a pooling layer, and a convolution layer with a stride of 2 is used for replacing the pooling layer to carry out a down sampling process of the feature map, so that loss of low-level features caused by the pooling layer can be effectively prevented. The DarkNet53 full convolution neural network also comprises 5 CSPNet modules; the structure not only can reduce the quantity of the parameters, but also can extract rich and comprehensive deep-level special featuresThe method has the advantages of overcoming the defect that redundant gradient information is generated in the back propagation process of the network, and enhancing the learning ability of the network. To reduce the computational cost, a lightweight network is implemented. The present module introduces a depth separable convolution to optimize the residual module in the CSPNet structure. The module replaces a standard convolution layer with 3 multiplied by 3 and step length of 1 in the CSPNet module by using the depth separable convolution network, which can effectively reduce network parameters and improve response speed. As shown in fig. 4, the depth separable convolution consists of a depth-wise convolution and a point-wise convolution, wherein the depth-wise convolution processes an input feature map by using a convolution kernel with a size of k×k, and the number of convolution kernels and the number of feature map channels c ₁ Keeping consistency; the point-by-point convolution uses a c of size 1×1 ₂ The convolution kernels integrate the results of the depth-wise convolution and change the number of channels of the final output feature map. In the original network, the number of parameters A required for standard convolution of the k×k size ₁ And the calculated amount B ₁ As shown in formula (1), the depth of the same convolution kernel size can separate the parameter quantity A required by convolution ₂ And the calculated amount B ₂ As shown in formula (2). The analysis of the combined formulas (1) and (2) can obtain that the parameter quantity and the calculated quantity of the improved convolution operation can be reduced to 1/c of the original convolution ₂ +1/k ² 。

The high-order characteristic modeling network consists of an SPP network and a PANet network; the SPP network carries out maximum pooling operation under multiple scales on the feature map, improves the receptive field of the network, and discovers important context information in the scene map; the PANet network adds a bottom-up shortcut on the basis of the feature pyramid network FPN, so that fine-grained local information can be directly used for the top layer, and the detection capability of the network on targets with different sizes is enhanced. The present module optimizes the structure of the SPP and PANet networks in view of the facilitating effect of CSPNet on model training. FIG. 5 shows an optimized front-to-back comparison, in which an input feature map with SPP channels c is input into an optimized network in two parts, wherein a part of features are spliced with multi-scale features through jump connection; as shown in fig. 6, in a PANet network, two consecutive residual blocks are surrounded by a jump connection instead of 5 consecutive convolutional layers in the original network.

The target detector adopts an anchor frame detection algorithm to predict a target entity, consists of three YOLO heads, uses a K-means algorithm to cluster sample targets to obtain the prior frame size, and further calculates the size and the position of a prediction frame where a target object is positioned by using a relative offset; the module uses CIOU error, the error calculation mode utilizes Euclidean distance between frames to evaluate the loss when the detection frame and the real frame are not overlapped, and in addition, the length and width loss of the frame is also considered, so that the prediction frame is ensured to be more attached to the real frame. Let the true frame of the target object be a, the predicted frame be b, and the CIOU error be represented by the formula (3):

S2: will demand text q= { Q ₁ ,…,q _m The semantic understanding unit inputs the required text q= { Q by using the grammar parser ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes isIt is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +. >Understanding the user's intention to generate the corresponding manipulation instruction code +.>Wherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v And (3) taking the dependency syntax tree node characteristic set H as reference information to assist a post-positioned positioning unit to accurately position the area to be updated. As shown in fig. 7.

The semantic understanding unit is responsible for understanding semantics in the description of the user modification requirements. Conventional approaches typically encode only a text description into a single feature vector, and then calculate the similarity of that vector to each visual region feature code, thereby locating the picture region to be edited. Since fine-grained semantic information in text is not modeled, entities and relationships in text are identified, for example. When encountering complex demand text, these methods have difficulty understanding fine-grained information contained therein; it is difficult to accurately locate the picture area, resulting in a large deviation between the rendering result and the required text. In order to solve this problem, the present embodiment adopts the method of step S2 described above to make the required text q= { Q ₁ ,…,q _m The method is converted into a dependency syntax tree to realize fine granularity analysis on the required text, and the text semantic understanding and reasoning capacity is improved.

Specifically, in step S2, the semantic understanding unit uses a open-source dependency grammar parser (i.e., space tool) to obtain a grammar tree of the required text, and then uses the bidirectional tree LSTM neural network to represent the characteristics of the nodes in the tree. Unlike common LSTM network, the bidirectional tree LSTM network can capture the association dependency relationship of nodes in the book, encode the context semantic information of father node and child node, specifically, for node t in the tree, the unit will correspond to word q _t Is a representation vector W of (a) _emb e _t Inputting the characteristic codes into a bidirectional tree-shaped LSTM network, and finally obtaining the characteristic codes of the nodeWherein W is _emb Embedding matrices, e, for trainable words _t One-time thermal coding for corresponding words,/->The output results of the tree-shaped LSTM network from bottom to top and from top to bottom are respectively obtained, the average value of all node feature codes in the tree is finally taken as the global feature code q of the demand text, and the global feature code q is mapped into corresponding control instruction codes +.>Wherein->Respectively trainable weights and bias terms.

S3: region feature code set v= { V from target detection unit ₁ ,…,v _M Dependency syntax tree node feature code set for } and demand text from semantic understanding unitInput a localization inference unit, as shown in formula (4), the localization inference unit calculates each image region i about +. >Location score S (v) _i H) so as to determine the one with the highest score as the edit area,

the positioning reasoning unit adopts a tree-shaped modularized network to normalize the visual positioning process into a dependency syntax tree, assembles a neural module network for calculating temporary positioning scores of areas for each node in the tree, finally obtains the positioning score of each area by integrating the temporary scores from bottom to top, and selects the area with the highest positioning scoreThe domain l is used as a region to be edited, and the characteristic code of the region to be edited is v _l 。

The positioning reasoning unit jointly considers the region feature coding set V= { V from the pre-unit ₁ ,…,v _M Dependency syntax tree node feature code set for } and demand textAnd positioning the area to be edited. In order to achieve accurate positioning, the model also calculates fine granularity association of a given scene graph and a required text on the basis of understanding the two. Therefore, the unit designs a tree-shaped modularized network, the visual positioning process is normalized to a dependency syntax tree, a neural module network for calculating temporary positioning scores of areas is assembled for each node in the tree, and the positioning scores of each area are finally obtained by integrating the temporary scores from bottom to top. Specifically, the designed tree-shaped modular network consists of three types of modules, including a scoring module, an integrating module and a relation reasoning module. Each module updates a temporary location score ++for each region of the scene graph at the corresponding node t > And delivers the result to the parent node. The three modules are described below:

in step S3, the positioning reasoning unit comprises a scoring module, an integration module and a relationship reasoning module, each of which updates the temporary positioning score for each region of the scene graph at the corresponding node tWherein,,

and a scoring module: the method comprises the steps of evaluating similarity of region feature codes of each scene graph and feature codes of target nodes in a tree, wherein the similarity is only assembled to leaf nodes and root nodes; first, the feature code v of each region is calculated _i And the node characteristic codes h _t Similarity S _vh (v _i ,h _t )＝fc(L2norm(fc(v _i )⊙h _t ) Where fc is the full connection layer, if the module is at the leaf node, the calculated similarity is directly used as the temporary location scoreAnd is delivered to the module assembled by the parent node; if at the root node, the similarity is added to the temporary scores of other child nodes for the regionAs final location score, where N _t A node set under a node t;

and (3) an integration module: as shown in equation (5), the module integrates the temporary location scores delivered to the node t by the child node and delivers the results to the parent node,

and a relation reasoning module: the module executes compound machine reasoning according to the relation words between two grammar components in the demand text, as shown in a formula (6):

The module firstly aggregates the region feature codes according to the similarity between each region feature code and the node feature code to obtain the context region feature codeThis feature is used to characterize the area information that locates the scene graph from the text relationships, such as "wall over sofa". This contextual regional signature encoding is then used to calculate a temporary location score for the individual region, as shown in equation (7):

The scoring module, the integrating module and the relation reasoning module are assembled to each node in the dependency syntax tree according to the rule, and the scoring module is assembled to all leaf nodes and root nodes first; for other nodes, whether the assembly integration module or the relation reasoning module is determined according to the node feature codes is shown in a formula (8). Finally, the unit selects the area l with the highest positioning score at the root node module as the area to be edited.

EditArea←argmaxsoftmax(fc(h _t ))(8)。

S4: identifying editing operation op in text from semantic understanding unit and feature code v from region to be edited/determined by positioning reasoning unit _l An input content rendering unit that encodes v the characteristics of the region to be edited/based on the editing operation op identified from the text _l And modifying, and inputting the modified characteristics into a generator for rendering. In recent years of artificial intelligence technology research, the generated countermeasure networks (Generative adversarial network, GAN) have demonstrated great potential in content generation. Considering that the GAN can generate high-quality and high-resolution images, the content rendering unit takes the GAN network as a generator and designs a regularizer to train the generator, thereby ensuring that the modification of the feature code of the target region does not affect other irrelevant regions, enhancing the compatibility of the modified region and the adjacent regions, and avoiding hard connection. As shown in formula (9), the unit adds the control instruction code op to the feature code v of the region to be edited _l In the method, the modified regional characteristic code set is input into a GAN network to output a rendered sceneWherein alpha is a preset parameter, and the alpha is a preset parameter,

In step S4, the content rendering unit uses the GAN network as a generator, and designs a regularization device to restrict the model to modify only the visual information of the target area for training the GAN network, without affecting other irrelevant areas; moreover, through countermeasure learning, the compatibility of the editing area and the adjacent area is maximized, and the situation of hard connection is avoided. Specifically, for the scene region feature encoding set v= { V ₁ ,…,v _M Sampling from a standard Gaussian distribution N (0,I)To interfere with the region feature code v to be edited _l Sample-> To interfere with feature encoding of other regions; correspondingly, the GAN network will output two disturbing scene images, i.e. +.>And-> On the one hand, the information interfering with the feature encoding of the region to be edited should not affect other regions, as shown in equation (10),

M ₁ is a binary mask, and is used for covering the area to be edited, so that the change of other areas is concerned; on the other hand, the information of the region to be edited should not be affected by disturbing the other region feature codes, as shown in the formula (11),

M ₂ a binary mask for covering other areas so as to pay attention to the change of the area to be edited; finally, as shown below, the training goal of the GAN network is interference generation lossAnd generating loss- >Sum of lambda ₁ 、λ ₂ Is a preset parameter, as shown in a formula (12),

the embodiment provides a scene rendering method for fine-grained semantic manipulation. Unlike traditional rendering methods, the method can directly understand the text requirement without manually delineating an editing area, and can perform fine-granularity local editing on an original scene graph. The user can put forward the editing requirement in a text form, and add, delete and modify the scene graph, so that convenient and controllable rendering is realized. The text is more deeply understood, and the semantic reasoning capability is provided. In addition, the demand text is converted into a dependency syntax tree, and an inference module is assembled on the tree, so that the association between scene graph features and text features is enhanced, and the positioning is more accurate. The embodiment designs a regularization device to restrict the scene generation model, and can accurately modify the visual information of the target area without affecting other irrelevant areas. Compared with the prior art, the target detection network designed by the embodiment has the advantages of less parameter quantity, faster response speed, better discovery of deep features of the scene graph and higher indoor target detection accuracy.

Example III

The present embodiment provides a scene rendering device of a scene rendering method based on fine granularity semantic manipulation of the first embodiment or the second embodiment, including:

a semantic understanding unit for using a grammar parser to parse the required text q= { Q ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes isIt is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +.>Understanding user intent to generate corresponding manipulation instruction codesWherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v Consistent, dependency syntax tree node feature set H is used as reference information to assist a post-positioned positioning unit to accurately position to-be-updatedIs a region of (2);

further, the target detection unit comprises a trunk feature extraction network, a high-order feature modeling network and a target detector; the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full-convolution neural network, the DarkNet53 full-convolution neural network comprises 53 convolution layers, each layer is followed by a batch regularization layer and an activation layer, no pooling layer exists, a step-2 convolution layer is used for replacing the pooling layer to carry out a downsampling process of a feature map, the DarkNet53 full-convolution neural network also comprises 5 CSPNet modules, and depth separable convolution is introduced to optimize residual modules in the CSPNet structure; the high-order characteristic modeling network consists of an SPP network and a PANet network; the target detector consists of three YOLO heads;

For other aspects of the apparatus of this embodiment, please refer to embodiment two.

In summary, the embodiment of the invention provides a fine-granularity semantic manipulation scene rendering method, which is an interactive scene rendering method, can perform local editing on a scene graph, has simple editing operation, can be triggered through language description, and does not need to manually define the editing position like the traditional manual editing. The user can put forward the editing description in the text form, and add, delete and modify the picture part, so that convenient and controllable rendering is realized. The device first identifies objects in the scene graph, then analyzes the text demand description of the user, finely locates the scene area and operation type to be edited, and finally renders a high-quality scene graph based on the antagonistic neural network. The method aims at understanding the requirements of users through natural language processing technology, and can automatically position the area to be edited and modified without manual delineation. The result is rendered by editing the region. Unlike traditional approaches, this incremental rather than full update approach can significantly reduce operating costs, enabling interactive rendering. This manipulation is data driven and typical semantic features can be extracted based on a large-scale image dataset. The user can specify the structure of the scene and modify the scene content, the method can edit the scene according to the requirements, and the synthetic operation artifacts can be reduced, so that the rendering result is smoother. Specifically, the method includes the steps of firstly identifying entities and relations in a scene graph, then analyzing a user demand description text, and locating an area to be edited and an operation type. Then, developing a non-conditional antagonism neural renderer to edit the region, and enabling the rendered result to approximate to the real image as much as possible through antagonism learning, so as to solve the compatibility problem of the generated scene and the real result. Compared with the traditional method, the technology does not need the user to manually delineate the editing area, and has stronger flexibility and user friendliness. The visual adjustable scene design can effectively improve the efficiency of application fields such as building space involvement and the like. In addition, the invention also provides a scene rendering device with fine granularity semantic control, which has the beneficial effects and is not repeated here.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present invention, and these modifications and substitutions should also be considered as being within the scope of the present invention.

Claims

1. The scene rendering method for fine granularity semantic manipulation is characterized by comprising the following steps of:

2. the method for rendering a scene with fine-grained semantic manipulation according to claim 1, wherein in step S1, the object detection unit comprises a backbone feature extraction network, a high-order feature modeling network, and an object detector,

the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full convolution neural network; CSPNet inputs original input feature V ₀ Divided into two parts V ₀ ^′ And V ₀ Inputting the two paths of cross-stage hierarchical structure to be merged; the dark net53 full convolutional neural network contains 53 convolutional layers, each followed by a batch regularization layer and an activation layer; the DarkNet53 full convolution neural network has no pooling layer, and a convolution layer with a stride of 2 is used for replacing the pooling layer to carry out the down sampling process of the feature map; the DarkNet53 full convolution neural network also comprises 5 CSPNet modules;

3. The scene rendering method for fine-granularity semantic manipulation according to claim 2, wherein a depth separable convolution network is used to replace a 3×3 standard convolution layer with a step length of 1 in a CSPNet module, the depth separable convolution is composed of a depth-wise convolution and a point-wise convolution, the depth-wise convolution processes an input feature map by using a convolution kernel with a size of k×k, and the number of convolution kernels and the number of feature map channels c ₁ Keeping consistency; the point-by-point convolution uses a c of size 1×1 ₂ The convolution kernels integrate the results of the depth-wise convolution and change the number of channels of the final output feature map.

4. The scene rendering method for fine-granularity semantic manipulation according to claim 2, wherein in the SPP network, an input feature map with the number of SPP channels being c is input into the optimized network in two parts, and one part of features are spliced with the multi-scale features through jump connection; in a PANet network, two consecutive residual blocks are surrounded by a jump connection to replace the 5 consecutive convolutional layers in the original network.

5. The method for rendering scenes with fine-grained semantic manipulation according to claim 1, wherein in step S2, the semantic understanding unit uses an open-source dependency grammar parser to obtain a grammar tree of the required text,then, the characteristics of the nodes in the tree are represented by utilizing a bidirectional tree-shaped LSTM neural network, the bidirectional tree-shaped LSTM neural network can capture the association dependency relationship of the nodes in the book, the context semantic information of the father node and the child node is encoded, and particularly, for the node t in the tree, the unit corresponds to the word q _t Is a representation vector W of (a) _emb e _t Inputting the characteristic codes into a bidirectional tree-shaped LSTM network, and finally obtaining the characteristic codes of the nodeWherein W is _emb Embedding matrices, e, for trainable words _t One-time thermal coding for corresponding words,/->The output results of the tree-shaped LSTM network from bottom to top and from top to bottom are respectively obtained, the average value of all node feature codes in the tree is finally taken as the global feature code q of the demand text, and the global feature code q is mapped into corresponding control instruction codes +.> Wherein->Respectively trainable weights and bias terms.

6. The method of scene rendering with fine-grained semantic manipulation according to claim 2, characterized in that in step S3, the localization inference unit comprises a scoring module, an integration module and a relational inference module, each module updating the temporary localization score for each region of the scene graph at the corresponding node tWherein,,

scoring module: the method comprises the steps of evaluating similarity of region feature codes of each scene graph and feature codes of target nodes in a tree, wherein the similarity is only assembled to leaf nodes and root nodes; first, the feature code v of each region is calculated _i And the node characteristic codes h _t Similarity S _vh (v _i ,h _t )＝fc(L2norm(fc(v _i )⊙h _t ) Where fc is the full connection layer, if the module is at the leaf node, the calculated similarity is directly used as the temporary location score And is delivered to the module assembled by the parent node; if at the root node, the similarity is added to the temporary scores of other child nodes for the regionAs final location score, where N _t A node set under a node t;

the module firstly aggregates the region feature codes according to the similarity between each region feature code and the node feature code to obtain the context region feature codeThe features are used to characterize the rootLocating the region information of the scene graph according to the text relation; this contextual regional feature code is then used to calculate a temporary location score for the individual region as shown in the following equation:

EditArea←argmaxsoftmax(fc(h _t ))。

7. The fine-grained semantic manipulation scene rendering method according to claim 1, wherein in step S4, the content rendering unit takes a GAN network as a generator, and a regularization device is designed to restrict the model to modify only visual information of the target area without affecting other irrelevant areas for training the GAN network; and the compatibility of the editing area and the adjacent area can be maximized through countermeasure learning.

8. The fine granularity semantic manipulation scene rendering method according to claim 7, wherein for the scene region feature encoding set v= { V ₁ ,…,v _M Sampling from a standard Gaussian distribution N (0,I) To interfere with the region feature code v to be edited _l Sample->To interfere with feature encoding of other regions; correspondingly, the GAN network will output two disturbing scene images, i.e. +. > Andon the one hand, the disturbance of the region feature code to be edited should not affect the information of other regions, as shown in the following formula,

M ₂ a binary mask for covering other areas so as to pay attention to the change of the area to be edited; finally, as shown below, training objectives of the GAN networkMarked as interference generation lossAnd generating loss->Sum of lambda ₁ 、λ ₂ In order to set the parameters to be in the preset,

9. a fine-grained semantic manipulation scene rendering device, comprising:

a semantic understanding unit for using a grammar parser to parse the required text q= { Q ₁ ,…,q _m Conversion into dependency syntax tree, feature set of corresponding tree nodes is It is used to describe the fine-grained semantics of the demand text; on the basis, the global feature code +.>Understanding user intent to generate corresponding manipulation instruction codesWherein m is the required text length, d _q For the demanded text feature coding dimension, the op dimension and the region feature coding dimension d output by the scene graph target detection unit _v Consistent, the dependency syntax tree node characteristic set H is used as reference information to assist a post-positioned positioning unit to accurately position the area to be updated;

A content rendering unit for encoding v the characteristics of the region to be edited l according to the editing operation op identified from the text _l Modifying, and inputting the modified characteristics into a generator for rendering;the content rendering unit takes a GAN network as a generator, and designs a regularizer to train the generator, namely, adds a control instruction code op into a feature code v of a region to be edited _l In the method, the modified regional characteristic code set is input into a GAN network to output a rendered sceneWherein alpha is a preset parameter, and the alpha is a preset parameter,

10. the fine-grained semantically-manipulated scene rendering device of claim 9,

the target detection unit comprises a trunk feature extraction network, a high-order feature modeling network and a target detector; the trunk feature extraction network is formed by combining a CSPNet network and a DarkNet53 full-convolution neural network, the DarkNet53 full-convolution neural network comprises 53 convolution layers, each layer is followed by a batch regularization layer and an activation layer, no pooling layer exists, a step-2 convolution layer is used for replacing the pooling layer to carry out a downsampling process of a feature map, the DarkNet53 full-convolution neural network also comprises 5 CSPNet modules, and depth separable convolution is introduced to optimize residual modules in the CSPNet structure; the high-order characteristic modeling network consists of an SPP network and a PANet network; the target detector consists of three YOLO heads;