CN116542859B

CN116542859B - Intelligent generation method of building structure column image thumbnail for intelligent construction

Info

Publication number: CN116542859B
Application number: CN202310825541.7A
Authority: CN
Inventors: 陈世宁
Original assignee: Wuhan Institute of Shipbuilding Technology
Current assignee: Wuhan Institute of Shipbuilding Technology
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-09-01
Anticipated expiration: 2043-07-06
Also published as: CN116542859A

Abstract

The invention relates to the technical field of image processing, in particular to an intelligent generation method of a building structure column image thumbnail for intelligent construction. The method inputs the obtained thumbnail image of the building structure column, the corresponding depth map and the normal map into a thumbnail generation network to output the corresponding optimal thumbnail image. The training process of the thumbnail generation network is as follows: determining view descriptors in combination with the thumbnail images under the normal map, the depth map and the color channels; constructing a semantic descriptor according to the thumbnail image; constructing a depth descriptor according to the depth map, screening out partial thumbnail images by combining the semantic descriptor, the depth descriptor and the view descriptor to generate a training set; the thumbnail generation network is trained according to the view descriptor, the depth descriptor, the semantic descriptor and the training set. The invention improves the accuracy of thumbnail image generation of the building structural column.

Description

Intelligent generation method of building structure column image thumbnail for intelligent construction

Technical Field

The invention relates to the technical field of image processing, in particular to an intelligent generation method of a building structure column image thumbnail for intelligent construction.

Background

In the field of building design and construction, revit is a specialized building information model (Building Information Modeling, BIM) software for creating and managing building information models. However, when using Revit for building design and construction, manual screen shots are often required to view the look of the structural columns, a process that is time consuming, laborious and inefficient. The reason is that the rendering effect of Revit is unified, if a designer or constructor needs to look at where the appearance of a plurality of structural columns is clear, which structural column needs to be determined according to the information in the diagram, this process becomes more inefficient and tedious, so that the thumbnail cannot play a good role.

It is common today to use the screenshot directly as a thumbnail of a structural column. The problem of traditional screenshot is that the accuracy is lower, because visual perception is limited, and many interference factors can be produced in the process of initiative screenshot, it is difficult to guarantee that the screenshot of every structural column is accurate, and inaccurate structural screenshot can lead to errors or delay in design or construction process, can bring extra cost and pressure to the project.

Disclosure of Invention

In order to solve the technical problem that the accuracy of directly taking a screenshot as a thumbnail is low, the invention aims to provide an intelligent generation method for a building structure column image thumbnail for intelligent construction, and the adopted technical scheme is as follows:

Acquiring a thumbnail image, a corresponding depth map and a normal map of a building structure column; inputting the thumbnail images, the corresponding depth maps and the normal maps into a thumbnail generation network to output corresponding optimal thumbnail images;

the training process of the thumbnail generation network is as follows:

dividing the thumbnail image, the corresponding depth map and the corresponding normal map into a plurality of grids according to the same proportion; determining a view descriptor by combining the change characteristics of normal values corresponding to all pixel points in a grid in a normal line graph, depth values corresponding to all pixel points in a corresponding grid in a depth graph and the contrast of a corresponding grid in a thumbnail image under all color channels; dividing the thumbnail image to obtain at least two division categories, and constructing a semantic descriptor according to the area occupation ratio of each division category; constructing a depth descriptor according to the depth map, and constructing a differential index by combining the semantic descriptor and the depth descriptor; screening out part of the thumbnail images according to the difference of the distinguishing degree indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors to generate a training set; the thumbnail generation network is trained according to the view descriptor, the depth descriptor, the semantic descriptor and the training set.

Preferably, the determining the view descriptor by combining the change feature of the normal value corresponding to each pixel point in the grid in the normal map, the depth value corresponding to each pixel point in the corresponding grid in the depth map, and the contrast of the corresponding grid in the thumbnail image under each color channel includes:

combining the change characteristics of the normal values corresponding to the pixel points in each grid in the normal map and the depth values corresponding to the pixel points in the corresponding grid in the depth map to obtain index values of each grid; constructing index value vectors by index values of all grids in the normal map;

obtaining a joint value of each grid under each color channel according to the contrast of each grid in the thumbnail image under each color channel and the contrast of the corresponding grid in the depth map; constructing a joint value vector corresponding to each color channel by the joint values of all grids under each color channel;

and constructing a view descriptor by the index value vector and the joint value vector under each color channel.

Preferably, the step of obtaining the index value of each grid by combining the change feature of the normal value corresponding to each pixel point in each grid in the normal map and the depth value corresponding to each pixel point in the corresponding grid in the depth map includes:

Obtaining kurtosis of each grid according to the change characteristics of the normal values corresponding to the pixel points of each grid in the normal map; taking the product of the kurtosis of each grid and the average value of the depth values corresponding to the pixels in the depth map of the corresponding grid as an index value of each grid.

Preferably, the obtaining the joint value of each grid under each color channel according to the contrast of each grid in the thumbnail image under each color channel and the contrast of the corresponding grid in the depth map includes:

selecting any color channel as a target color channel, selecting any grid in the thumbnail image as a target structure grid, and taking the grid of the target structure grid at the corresponding position in the depth map as a target depth grid;

calculating the contrast of the color channel value corresponding to each pixel point in the target color channel of the target structural grid by using a Michelson Contrast algorithm, and taking the contrast as the color contrast; calculating the contrast of the depth value corresponding to each pixel point in the target depth grid in the depth map by using a Michelson Contrast algorithm, and taking the contrast as the depth contrast;

and taking the product of the color contrast corresponding to the target structure grid and the depth contrast corresponding to the target depth grid as the joint value of the target structure grid under the target color channel.

Preferably, the splitting thumbnail image obtains at least two splitting categories, and the semantic descriptor is constructed by the area ratio of each splitting category, including:

semantic segmentation is carried out on the thumbnail images by utilizing a semantic segmentation network, so that at least two segmentation categories are obtained; the area occupation ratio of each division category is obtained, the area occupation ratio is subjected to extremely poor standardization, the occupation ratio feature value of each division category is obtained, and the corresponding semantic descriptors are constructed by the occupation ratio feature values of all the division categories.

Preferably, the constructing a depth descriptor according to the depth map includes:

taking the pixel point in the segmentation category with the largest area as a main body pixel point, and mapping the main body pixel point into a depth map to obtain a depth main body pixel point in the depth map;

and constructing a depth descriptor according to the occurrence frequency of the depth value of each depth main body pixel point in the depth map.

Preferably, the filtering out part of the thumbnail images according to the difference of the discrimination index, the difference of the semantic descriptors and the difference of the view descriptors to generate the training set includes:

determining local abnormality factors of the thumbnail images according to the difference of the discrimination indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors; and screening out part of the thumbnail images according to the local abnormal factors to generate a training set.

Preferably, the determining the local abnormality factor of each thumbnail image according to the difference of the discrimination index between thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors includes:

determining the reachable distance between the thumbnail images according to the difference of the distinguishing degree indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors; and determining local abnormality factors corresponding to the thumbnail images according to the reachable distances between the thumbnail images.

Preferably, the determining the reachable distance between the thumbnail images according to the difference of the discrimination index between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors includes:

for any two thumbnail images, calculating the similarity of the distinguishing index between the two thumbnail images, taking the similarity as initial similarity, and carrying out negative correlation mapping on the initial similarity to obtain corresponding distinguishing degree difference; calculating the difference of semantic descriptors between two thumbnail images to be used as a semantic distance; calculating the difference of view descriptors between two thumbnail images as a view distance; taking the product of the difference in discrimination, the semantic distance and the view distance as the reachable distance between two thumbnail images.

Preferably, the filtering the partial thumbnail images according to the local anomaly factors to generate a training set includes:

and according to the size of the local abnormal factor of each scaled image, sequencing from large to small, constructing a local abnormal factor sequence, and taking the scaled image corresponding to the first half of the local abnormal factor sequence as a thumbnail to generate an image in a training set of the network.

The embodiment of the invention has at least the following beneficial effects:

the method inputs the obtained thumbnail image of the building structure column, the corresponding depth map and the normal map into the thumbnail generation network to output the corresponding optimal thumbnail image, so that the optimal thumbnail image is obtained by utilizing the deep learning network, the judging time is shortened, and the accuracy of thumbnail image generation of the building structure column is improved. The training process of the thumbnail generation network is as follows: the view descriptors are determined by combining the characteristics of all pixel points in grids in the thumbnail image, the normal line drawing and the depth image, and the complexity of the background structure of the building structure column is an important index for determining the background information quantity, so that the building structure column is analyzed by combining the depth image and the normal line drawing, and the characteristics of the pixel points under all color channels of the thumbnail image are further analyzed in order to improve the accuracy of image view description; since other element objects possibly exist in the image besides the building structural column, the thumbnail image is further segmented to obtain at least two segmentation categories, and a semantic descriptor is constructed by the area ratio of each segmentation category, wherein the semantic descriptor also reflects the area ratio of the object in the thumbnail image; since the depth relationship is understood to be three-dimensional when the human eyes view the bearing columns, describing the depth map features based on the main body can effectively represent the most remarkable depth information of the thumbnail images viewed by the human eyes, so that the depth descriptors are constructed according to the depth map. And combining the semantic descriptors and the depth descriptors to construct a distinguishing index, wherein the distinguishing index reflects that the larger the corresponding difference between the two images is, the larger the difference between the two images and other images is when the content of the images is different, and the smaller the corresponding distinguishing index is when the two images are similar. And screening out part of thumbnail images to generate a training set according to the difference of the degree of distinction indexes between the thumbnail images, the difference of semantic descriptors and the difference of view descriptors, wherein the best thumbnail images output by the corresponding thumbnail image generation network can reflect the corresponding building structural columns when the images in the training set are diversified, so that part of thumbnail images are screened out from the thumbnail images to generate the training set, and the influence on the training process of the thumbnail image generation network caused by more similar images in the training set is avoided. The thumbnail generation network is trained according to the view descriptor, the depth descriptor, the semantic descriptor and the training set. According to the invention, the obtained multiple thumbnail images, the corresponding depth map and the normal map are input into the trained thumbnail generation network to output the corresponding optimal thumbnail image, so that the accuracy of thumbnail image generation of the building structural column is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for intelligently generating a thumbnail image of a building structure column for intelligent construction according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of specific implementation, structure, characteristics and effects thereof of an intelligent generation method for building structure column images for intelligent construction according to the invention in combination with the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The embodiment of the invention provides a concrete implementation method of an intelligent generation method for building structure column images for intelligent construction, which is suitable for a scene for generating the optimal thumbnail of the building structure column images. In order to solve the technical problem that the accuracy of directly taking the screenshot as the thumbnail is low. According to the invention, the obtained multiple thumbnail images, the corresponding depth map and the normal map are input into the trained thumbnail generation network to output the corresponding optimal thumbnail image, so that the accuracy of thumbnail image generation of the building structural column is improved.

The following specifically describes a specific scheme of the intelligent generation method of the building structure column image thumbnail for intelligent construction, which is provided by the invention with reference to the accompanying drawings. The method comprises the following specific contents:

acquiring a thumbnail image, a corresponding depth map and a corresponding normal map of a building structure column; inputting the thumbnail images, the depth maps and the corresponding normal maps into a trained thumbnail generation network to output the corresponding optimal thumbnail images.

First, as for the building structural column, the analysis operation of the subsequent step is performed with the bearing column as the building structural column in the embodiment of the present invention. And according to the camera height of the established thumbnail, carrying out rotary sampling around the central axis of the bearing column to obtain a plurality of depth maps of the bearing column. The rendering resolutions of the thumbnail image corresponding to the building structural column, the corresponding depth map and the corresponding normal map are 768×512.

The camera height of the thumbnail is first determined. According to the indoor bearing column with the height of 3-5 m, the camera is arranged at the height of 2m in the design stage, and the height of the camera can be determined according to the actual height in other embodiments.

Because the Revit software can obtain the depth map through the multimedia programming interface DirectX and other frameworks, such as OpenGL, in the three-dimensional rendering environment, the specific method is as follows: first, the practitioner adjusts the camera position to face the bearing post for previewing. The implementer then invokes the DirectX to obtain the depth map by injection, thereby obtaining the depth map without modifying the Revit source code. This is done for depth map acquisition in cases where the source code cannot be accessed or the application cannot be recompiled, such as Revit. Using the Microsoft library of functions (Microsoft Detours), the IDirect3DDevice 9:SetRIDErTarget function is intercepted, which is used to set rendering targets and depth buffers. Callback code is injected in the setendertarget function to facilitate retrieval of its references when setting the depth buffer. Specific: obtaining the surface of the depth buffer area by using an IDirect3DDevice9, which is a GetDepthStencilSurface function; locking the surface of the depth buffer area by using an IDirect3DSurface9 function, and copying the content of the depth buffer area into a system memory; finally unlock the surface and release the reference, and then return the depth image data. According to the method, the direct X API function is intercepted, and additional processing is carried out, so that the depth map of the preview picture of the bearing column is directly obtained, and the depth map D of the overall picture of the bearing column relative to other building elements and structures under the view is obtained.

Where the depth map is a gray scale image, each pixel contains a floating point value corresponding to the depth or distance of the pixel, and this floating point value is typically stored in a single channel of the image, the range of values of which depends on the implementation. For example, depth maps can be normalized directly to [0,1] in OpenGL and DirectX. It should be noted that, the distance unit represented by the pixel point in the depth map does not need to be defined, and the larger the value corresponding to the pixel point is, the farther the corresponding distance is.

The same method for acquiring the depth map is adopted, and based on one view of Revit, an IDirect3DDevice 9:SetRNAderTarget function is utilized to acquire a normal map corresponding to a building structure column. Callback code is injected and its references are obtained when the normal buffer is set. The specific contents are as follows: obtaining a normal buffer surface by using an IDirect3DDevice 9:GetRIENderTarget function; processing the surface of the normal buffer area by using an API function provided in DirectX, such as a D3DX ComputeNormamMap function and the like, so as to obtain normal image data; and finally copying the normal image data into a system memory, unlocking the surface and releasing the reference.

Since the bearing column is a column body, view angles for viewing the bearing column are various, and therefore, it is necessary to adjust the view of the camera in the Revit, and rotation sampling is performed according to the central axis of the bearing column. Specific: the implementer needs to acquire the camera object of the current view in the plug-in code. The view camera is first acquired using the GetCamera method of the autodesk. Revit. Ui. View class. Geometric information of the support columns is acquired to facilitate moving the camera near the support columns and adjusting the camera position and orientation. Specifically, the geometry information of the bearing column is obtained by using an Autodesk.Revit.DB.FilteredElementcollector class and an Autodesk.Revit.DB.GeometyInstance class. Since the geometry instance is a class disclosed for describing object features of the aggregate, the central axis is obtained based on the geometry instance, and detailed description is omitted in the embodiment of the present invention.

Based on Autodesk.Revit.DB.transform class, rotation and translation of the camera are performed, thereby realizing the position movement of the camera at one time, specifically: the camera is moved by cycling at a fixed angle using the SetCamera method of the autodesk. Review. Ui. View class. For each movement, a normal map N under the corresponding view needs to be obtained. Wherein the normal map is also a gray scale image, and each pixel contains a floating point value corresponding to the direction of the normal vector. This floating point value is stored in memory in the alpha channel of the image, typically in the range of [0,1].

For one building structure column, namely, for one bearing column, a plurality of depth maps can be obtained in a surrounding mode, in the embodiment, the surrounding angle is divided into 12, namely, 12 normal maps N and 12 depth maps D can be obtained in a surrounding mode.

And obtaining the image of the Diffuser based on the normal and depth map constraint based on a control layer-oriented real-time field bus network (ControlNet). The latent diffusion model is a generation model based on iterative diffusion steps, and can convert noise signals into high-quality images. It achieves this by performing a number of diffusion steps on the noise signal and averaging the current input signal with the previous output in each step to reduce noise and enhance detail. Low-Rank Adaptation (LoRA) is a training technique for fine tuning of a latent diffusion model. LoRA makes minor changes to the cross-attention layer primarily when modifying the Stable Diffusion model, thereby controlling the texture style generated. The content with uniform styles can be obtained by using the latent diffusion model based on LoRA. Thus, by the forefront obtained normal map and depth map, the generated content and spatial relationship can be controlled using the ControlNet technique, thereby highlighting the features of the spanners viewed at a certain viewing angle. The basic idea of ControlNet is to add additional input conditions during image generation in order to better control the image properties. The ControlNet combines it with the main image generation network by adding external conditions in the different layers of the generator network. The external conditions are passed to the ControlNet and automatically encoded into low dimensional vectors, which then enter the corresponding layers of the generator network, adjusting the image generation process. Specifically, according to the normal map and the depth map obtained in the earlier stage, the LoRA model designated by the implementer is loaded, and the image of the uniform-style spanners can be obtained on the basis of uncertain prompt words.

Thus, thumbnail images of 12 building structural columns were obtained.

Inputting the thumbnail images, the depth maps and the corresponding normal maps into a trained thumbnail generation network to output the corresponding optimal thumbnail images.

Referring to FIG. 1, a flowchart of training steps for a thumbnail generation network is shown. The training process of the thumbnail generation network is as follows:

step S100, dividing the thumbnail image, the corresponding depth map and the corresponding normal map into a plurality of grids according to the same proportion; and determining the view descriptor by combining the change characteristics of the normal values corresponding to the pixel points in the grids in the normal line graph, the depth values corresponding to the pixel points in the grids in the depth graph and the contrast of the grids in the thumbnail images under the color channels.

And constructing a view descriptor of the building structure column based on the thumbnail image, the corresponding depth map and the corresponding normal map of the building structure column.

According to the shooting paradigm of the bearing column, the bearing column is an object body close to the camera, in order to distinguish the bearing column, analysis is needed from the background of the bearing column, the complexity of the background structure is an important index for determining the background information amount, and the analysis is performed by combining a depth map and a normal map.

First, the thumbnail image, the corresponding depth map and the corresponding normal map are divided into a plurality of grids, respectively, according to the same scale. In the embodiment of the present invention, the grids are divided into 24×16, and since the sizes of the thumbnail image, the corresponding depth map and the corresponding normal map are all uniform, when the grids of the same size are used for dividing, the corresponding positions of the grids in the thumbnail image, the corresponding depth map and the corresponding normal map are the same, that is, for example, when any grid c in the corresponding thumbnail image is used, the grid of the same position and size in the depth map corresponding to the thumbnail image is used as the corresponding grid of the grid c in the depth map, and the same grid of the same position and size in the normal map corresponding to the thumbnail image is also used as the corresponding grid of the grid c in the normal map.

Further, the view descriptor is determined by combining the change characteristics of the normal values corresponding to the pixel points in each grid in the normal map, the depth values corresponding to the pixel points of the corresponding grid in the depth map and the contrast of the corresponding grid in the thumbnail image under each color channel, and the detail is that:

step one, combining the change characteristics of the normal values corresponding to the pixel points in each grid in the normal map and the depth values corresponding to the pixel points of the corresponding grid in the depth map to obtain index values of each grid; and constructing an index value vector from index values of all grids in the normal map.

The index value of each grid is obtained by combining the change characteristics of the normal value corresponding to each pixel point of each grid in the normal map and the depth value corresponding to each pixel point of the corresponding grid in the depth map, and the index value is specifically: obtaining kurtosis of each grid according to the change characteristics of the normal values corresponding to the pixel points of each grid in the normal map; taking the product of the kurtosis of each grid and the average value of the depth values corresponding to the pixels in the depth map of the corresponding grid as an index value of each grid.

Kurtosis (kurtosis) is an index used to describe the sharpness of a peak of a probability distribution, which measures the sharpness of a particular distribution relative to a standard normal distribution, a value of 0 indicating the same sharpness as a standard normal distribution, a positive kurtosis indicating that the distribution has a sharper peak than a standard normal distribution, and a negative kurtosis indicating that the distribution has a flatter peak than a standard normal distribution.

The calculation formula of kurtosis of each grid is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,kurtosis for the a-th grid; />The number of pixel points in the grid; />The normal value of the ith pixel point in the grid; />Is the mean value of the normal values of the pixel points in the grid.

The calculation formula value of kurtosis minus 2 is described because a constant deviation is included in the definition of kurtosis. The kurtosis reflects the sharpness of the peak, and if the kurtosis is greater than three, the shape of the peak is sharper than that of a normal distribution peak. The method for obtaining kurtosis is a well known technique for those skilled in the art, and will not be described in detail herein.

After the kurtosis of each grid is obtained, taking the product of the kurtosis of each grid and the average value of the depth values corresponding to the pixels of the corresponding grid in the depth map as an index value of each grid.

Step two, obtaining a joint value of each grid under each color channel according to the contrast of each grid in the thumbnail image under each color channel and the contrast of the corresponding grid in the depth map; and constructing a joint value vector corresponding to each color channel by the joint values of all grids under each color channel.

And analyzing local contrast based on the spandrel image and the depth map. Since the generated thumbnail image is an RGB image, in the embodiment of the present invention, the analysis is performed according to the images of the three channels of the red channel, the green channel, and the blue channel in the grid, respectively. Michelson contrast is used to describe the joint values of the local contrast and depth maps.

And selecting any color channel as a target color channel, selecting any grid in the thumbnail image as a target structure grid, and taking the grid of the target structure grid at the corresponding position in the depth map as a target depth grid. And calculating the contrast of the color channel value corresponding to each pixel point in the target color channel by using a Michelson Contrast algorithm, and taking the contrast as the color contrast.

Taking a red color channel as a target color channel, taking an ith grid in a scaled image as a target structure grid, wherein the color contrast of a corresponding color channel value of the target structure grid in the target color channel is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,color contrast for the color channel value corresponding to the target structural grid in the target color channel; />The maximum color channel value corresponding to the target structure grid in the target color channel is obtained; />The target structure grid corresponds to a minimum color channel value in the target color channel. It should be noted that, the method for obtaining the contrast by using Michelson contrast is a well-known technique for those skilled in the art, and will not be described herein.

And calculating the contrast of the depth value corresponding to each pixel point in the target depth grid in the depth map by using a Michelson Contrast algorithm, and taking the contrast as the depth contrast.

And taking the product of the color contrast of the color channel value corresponding to the target structure grid in the target color channel and the depth contrast corresponding to the target depth grid as the joint value of the target structure grid under the target color channel. And constructing a joint value vector corresponding to each color channel by the joint values of all grids under each color channel.

In the embodiment of the invention, the color channels have three color channels, so that the corresponding color channels have three corresponding joint value vectors.

And thirdly, constructing a view descriptor by the index value vector and the joint value vector under each color channel. It can also be said that the combination index value vector and the combination value vector under each color channel obtain the corresponding view descriptor. For example, the index value vector is denoted as Q1, and the corresponding three joint index vectors for the three color channels are respectively denoted as Q2, Q3, and Q4, and the corresponding view descriptors are (Q1, Q2, Q3, and Q4).

Step S200, dividing the thumbnail image to obtain at least two division categories, and constructing semantic descriptors according to the area ratio of each division category; and constructing a depth descriptor according to the depth map, and constructing a differential index by combining the semantic descriptor and the depth descriptor.

And analyzing and processing the thumbnail images of the building structural columns, and carrying out semantic segmentation on the thumbnail images by utilizing a semantic segmentation network to obtain at least two segmentation categories.

Because the training content of most of the latent diffusion models is mainly common things, the invention uses the semantic segmentation network passing through ADE20K to carry out semantic segmentation on the thumbnail image of each view of the bearing column, so as to obtain at least two segmentation categories.

Wherein, because of the ADE20K dataset, the labels of the typically trained semantic segmentation network are all trained in the top 50 categories of which the total pixel ratio ranks. The semantic segmentation network is Swin Transformer V2. In order to obtain stable semantic descriptors and avoid excessive segmentation ambiguity, the embodiment of the invention selects the first 50 segmentation categories for analysis, and an implementer can further limit or expand the segmentation categories according to the generated content of the LoRA model. It can also be said that 50 division results of the division categories can be obtained for each thumbnail image corresponding to the weight column.

The area occupation ratio of each division category is obtained, the area occupation ratio is subjected to extremely poor standardization, the occupation ratio feature value of each division category is obtained, and the corresponding semantic descriptors are constructed by the occupation ratio feature values of all the division categories. Namely, the result value obtained by normalizing the total area ratio of the area of the corresponding region in each segmentation class to the total area of the thumbnail image is used as each numerical value of the semantic descriptor. Each of the segmentation classes corresponds to an area ratio, which is the ratio of the area of the segmentation class to the area of the thumbnail image.

Since the content presentation of the largest part of semantics is the subject of the picture in the picture of the thumbnail image, and the generated content is determined based on the LoRA model. The embodiment of the invention uses a StableDiffuse 1.5 model, applies a LoRA model of an XSarchiture series, particularly a 7Modern interer, an implementer can replace the LoRA model of any indoor scene, and the implementer can use any Diffuse and any LoRA model to generate a thumbnail image G related to indoor design.

After semantic segmentation, indoor elements exist such as sofas, cupboards, walls, windows and the like. Therefore, the content of the largest category in the picture is selected, so that a main body is determined, namely, the thumbnail image is subjected to semantic segmentation by utilizing a semantic segmentation network, the segmentation category with the largest area ratio in the obtained segmentation categories is used as the main body, and the main body is mapped into the depth map corresponding to the thumbnail image.

And analyzing the depth map of the shade where the main body is located. Since the main body is a large-area element such as a window, a wall, a cabinet, etc., the depth map information in the picture where the main body is located can represent the structure and depth relationship of the main body in the thumbnail image.

Since the depth relationship in three dimensions is understood specifically when the human eye views the spanners, describing depth map features based on the subject may effectively represent the most significant depth information for a human to view thumbnail images.

Therefore, the method for constructing the depth descriptor according to the depth map corresponding to the thumbnail image is as follows: taking the pixel point in the segmentation category with the largest area as a main body pixel point, and mapping the main body pixel point into a depth map to obtain a depth main body pixel point in the depth map; and constructing a depth descriptor according to the occurrence frequency of the depth value of each depth main body pixel point in the depth map. It can also be said that the depth value of the corresponding pixel point of the main body pixel point in the depth map is obtained and used as the depth value corresponding to the main body pixel point, and the depth descriptor is constructed by the occurrence frequency of the depth value corresponding to the main body pixel point in the depth map. Wherein the frequency of occurrence has a value in the range of 0, 1. The frequency of occurrence of the depth values in the depth map also reflects the duty cycle of the distances in the relative depth relationship, i.e., the duty cycle of the depth values in the depth relationship.

After the semantic descriptors of the thumbnail images and the depth descriptors of the depth map are obtained, the distinction index is built by combining the semantic descriptors and the depth descriptors. Specific: the semantic descriptors and the depth descriptors corresponding to the thumbnail images and the corresponding depth maps are combined into a distinguishing degree index, and can be said to be spliced into a distinguishing degree index. Each thumbnail image and the corresponding depth map have a corresponding degree of distinction index. For example, if the semantic descriptor corresponding to the thumbnail image is z1 and the depth descriptor of the depth map corresponding to the thumbnail image is z2, the discrimination index between the thumbnail image and the corresponding depth map is (z 1, z 2).

Step S300, screening out part of thumbnail images according to the difference of the distinguishing degree indexes among the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors to generate a training set; the thumbnail generation network is trained according to the view descriptor, the depth descriptor, the semantic descriptor and the training set.

Screening out part of thumbnail images according to the difference of distinguishing degree indexes among the thumbnail images, the difference of semantic descriptors and the difference of view descriptors to generate a training set, and specifically: determining local abnormality factors of the thumbnail images according to the difference of the discrimination indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors; and screening out part of the thumbnail images according to the local abnormal factors to generate a training set.

Determining local anomaly factors of the thumbnail images according to the difference of the discrimination indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors, and specifically: the reachable distance between the thumbnail images is determined according to the difference of the discrimination index between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors.

The distinguishing degree index is constructed according to a semantic descriptor and a depth descriptor, and for any thumbnail image, the larger the difference between the distinguishing degree index corresponding to the thumbnail image and the distinguishing degree index corresponding to other thumbnail images is, the larger the local anomaly factors (Local Outlier Factor, LOF) corresponding to the thumbnail image are, the more distinctive the content of the thumbnail image is reflected, and the more specificity is provided in a large number of thumbnail images of bearing columns. In the thumbnail images of a large number of weight columns, if all the thumbnail images are very similar and there is no obvious difference, their scores will be very close and indistinguishable when analyzed using the LOF index. Therefore, for thumbnail images with specificity, their LOF index will be higher because they are more distinct from other thumbnail images and thus more easily distinguished. In other words, when there is a large difference between thumbnail images, the local abnormality factor corresponding to the thumbnail images is higher, which can better represent their specificity.

Determining the reachable distance between the thumbnail images according to the difference of the distinguishing degree indexes between the thumbnail images, the difference of the semantic descriptors and the difference of the view descriptors, and particularly: for any two thumbnail images, calculating the similarity of the distinguishing index between the two thumbnail images, taking the similarity as initial similarity, and carrying out negative correlation mapping on the initial similarity to obtain corresponding distinguishing degree difference; calculating the difference of semantic descriptors between two thumbnail images to be used as a semantic distance; calculating the difference of view descriptors between two thumbnail images as a view distance; taking the product of the difference in discrimination, the semantic distance and the view distance as the reachable distance between two thumbnail images.

The calculation formula of the reachable distance is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,an reachable distance of any thumbnail image a and other thumbnail images b except the thumbnail image a;a discrimination index which is a thumbnail image a; />A discrimination index for the thumbnail image b; />Initial similarity corresponding to thumbnail image a and thumbnail image b; />A difference in degree of distinction corresponding to the thumbnail image a and the thumbnail image b; l2 is L2 norm; />The semantic distance corresponding to the thumbnail image a and the thumbnail image b; / >View distances corresponding to the thumbnail image a and the thumbnail image b; />Semantic descriptors that are thumbnail images a; />Semantic descriptors for thumbnail image b; similarity is a cosine Similarity function; />A view descriptor that is a thumbnail image a; />Is a view descriptor of the thumbnail image b.

Wherein the difference of degree of distinctionThe degree of difference of the two descriptors, namely the semantic descriptor H and the depth descriptor F, is reflected. If the thumbnail image and the depth map of the calculated semantic descriptor and the calculated depth descriptor come from different spanners, but the semantic descriptor H and the depth descriptor F are similar, the thumbnail image is not suitable as the best thumbnail image because ambiguity is easily caused. When the distinguishing degree indexes of the two thumbnail images are more similar, the corresponding distinguishing degree difference is smaller, and the corresponding reachable distance is smaller; when the semantic descriptors of the two thumbnail images are larger in difference, the corresponding semantic distance is larger, and the corresponding reachable distance is larger; when the view descriptors of the two thumbnail images differ more, the corresponding view distance is larger, and the corresponding reachable distance is larger.

The larger the cosine similarity, the more similar the geometric features of the corresponding two thumbnail images. In the case of a large number of thumbnail images of the spanners, if all of the thumbnail images are very similar, their scores will be very close and indistinguishable when analyzed using the LOF index. Therefore, in analyzing the features of the thumbnail images G of the support columns using the LOF algorithm, it is necessary to analyze view descriptors, depth descriptors, and semantic descriptors corresponding to all the thumbnail images of the support columns to obtain the features of the thumbnail images.

After the reachable distance between the thumbnail images is obtained, the local abnormality factor corresponding to each thumbnail image is determined according to the reachable distance between the thumbnail images. Taking each thumbnail image as a sample, and calculating the k-neighborhood of each sample according to the reachable distance. For each sample, its local reachable density (Local Reachability Density, LRD) was calculated.

The calculation formula of the local reachable density is as follows:

the LRD (p) is a local reachable density corresponding to the thumbnail image p, which can be also referred to as a local reachable density corresponding to the sample point p; q is other samples in the k-neighborhood corresponding to the sample point p, namely other thumbnail images in the k-neighborhood corresponding to the thumbnail image p;is the distance between sample point p and sample point q; k is a preset neighborhood value; q is the sample point set in the k-neighborhood corresponding to sample p. In the embodiment of the present invention, the preset neighborhood value k is half of the number of thumbnail images surrounding the bearing column, and since one bearing column in the embodiment of the present invention has 12 corresponding thumbnail images, the corresponding k is 6, and in other embodiments, the value can be adjusted by an implementer according to actual situations.

It should be noted that, the calculation formula for calculating the local reachable density corresponding to each sample point according to the reachable distance is a well-known technique for those skilled in the art, and will not be described herein.

After obtaining the local reachable density corresponding to each sample point, i.e. after obtaining the local reachable density corresponding to each thumbnail image, a local discrete factor corresponding to each scaled image, i.e. a local anomaly factor (Local Outlier Factor, LOF) is calculated.

The calculation formula of the local anomaly factor is as follows:

the LOF (p) is a local anomaly factor corresponding to the thumbnail image p, and can also be said to be a local anomaly factor corresponding to the sample point p; LRD (p) is the local reachable density corresponding to the thumbnail image p, which can also be said to be the local reachable density corresponding to the sample point p; q is other samples in the k-neighborhood corresponding to the sample point p, namely other thumbnail images in the k-neighborhood corresponding to the thumbnail image p; k is a preset neighborhood value; q is a sample point set in a k-neighborhood corresponding to a sample p; n is the total number of sample points within the k-neighborhood to which sample p corresponds. The calculated local anomaly factor LOF values can be used to distinguish between more specific thumbnail images. It should be noted that, the calculation formula for calculating the local anomaly factor corresponding to each sample point according to the local reachable density is a well-known technique of those skilled in the art, and will not be described herein. The calculation of the local abnormal factor index needs to specify a k value in advance, and the selection of the k value has a certain influence on the result, in the embodiment of the invention, the k value is 6, and in other embodiments, the value can be adjusted by an implementer according to the actual situation. Meanwhile, the local anomaly factors are sensitive to density change of sample distribution, so that the value of k needs to be adjusted by combining with a specific scene when the method is used. It should be noted that, the calculation formula for calculating the local anomaly factor corresponding to each sample point according to the local reachable density is a well-known technique of those skilled in the art, and will not be described herein.

After obtaining the local anomaly factors of each scaled image, screening out part of thumbnail images according to the local anomaly factors to generate a training set of a thumbnail generation network, and specifically: and according to the size of the local abnormal factor of each scaled image, sequencing from large to small, constructing a local abnormal factor sequence, and taking the scaled image corresponding to the first half of the local abnormal factor sequence as a thumbnail to generate an image in a training set of the network. That is, ranking all generated thumbnails according to the size of the local anomaly factor, determining a certain proportion by an implementer, and selecting more specific thumbnails, wherein Top-50%, i.e. Top 50%, of the thumbnails are selected as thumbnails for easily distinguishing bearing columns in the embodiment of the invention, the selected thumbnail images are used for generating a training set of a network by subsequent thumbnails, and the training set is in the format of: the sample is a semantic descriptor and a depth descriptor corresponding to the thumbnail image G, and the label is the thumbnail image G.

And training a new LoRA model according to the view descriptor as a new word label, and directly generating the optimal thumbnail image. Among them, lora is a technology widely used in a transducer architecture, and can be used for self-attitution modules and MLP modules. In the Stable diffration model, lora is used in the Cross-Attention layer where conditions and image representations are associated. The use of Lora enables the Stable diffration model to achieve performance comparable to full model tuning with less memory space and less computational overhead. When using an image and trigger keywords as inputs to the CLIP of a Diffuser, if keywords are present for the supervised prompt, an image may be generated that is similar in style, texture, information to the thumbnail content of the training set.

In order for the latent Diffusion model to be able to generate thumbnail images that conform to the spanners and are easily distinguishable, the LoRA weights of the Diffusion model need to be trained so that the image generation model can get more easily distinguishable spanners in the CLIP from the content of the view descriptor. Specific:

firstly, word embedding is carried out on numerical values of the view descriptors, and the specific method comprises the following steps:

since CLIP has been trained and word embedding relationships have been determined, then the words of the view descriptor, the words of the semantic descriptor, and the words of the depth descriptor can be derived from the index of the vocabulary. Because the value ranges of the semantic descriptors and the depth descriptors are uncertain, the implementer needs to process the semantic descriptors and the depth descriptors as follows to obtain mapped words, and the specific words are as follows: firstly, carrying out normalized scaling on a semantic descriptor and a depth descriptor, and respectively scaling the semantic descriptor and the depth descriptor to [0,1000], thereby improving the word diversity of the semantic descriptor and the depth descriptor mapping, and then carrying out dimension splicing to obtain a vector X with higher dimension.

Words corresponding to each dimension in the higher-dimension vector X, such as "a", "of", "model", "1girl", "extra", etc., are nonsensical but can be features that drive the constraint generation of the LoRA model, are determined based on the vector X. The keywords marked by LoRA training are obtained by corresponding to each value of X, so that after a language image pre-training (Constastive Language-Image Pretraining, CLIP) model processes related text token, the content generated by the latent diffusion model is extremely close to the content contained in the thumbnail. And placing the processed scaled image and the corresponding label in the same folder, and designating the path of the data set in the training code.

It should be noted that, the training method of the LoRA fine tuning for Stable diffion1.5 is well known, and only the key steps are described in the embodiment of the present invention, specifically: parameters of the LoRA model, such as learning rate, batch size, epoch number, path to save the model, etc., are set in the training code. The example of the method uses a learning rate of 0.0011, a lot size of 12, and an epoch number of 30000. If an implementer has a higher performance GPU, such as V100, a larger lot size and a larger epoch number may be used. The batch size is the number of samples selected before each parameter adjustment. Furthermore, the implementer can obtain the best thumbnail conforming to the characteristics of the bearing column in the text-to-image generation task by only providing the X-mapped words for the CLIP model, namely, the implementation can drive the thumbnail generation network to generate the best thumbnail by only obtaining the thumbnail image, the corresponding normal line diagram and the corresponding depth diagram and further constructing the corresponding depth descriptor and the semantic descriptor in the subsequent thumbnail generation process of the bearing column.

In summary, the present invention relates to the field of image processing technology. The method inputs the obtained thumbnail image of the building structure column, the corresponding depth map and the normal map into a thumbnail generation network to output the corresponding optimal thumbnail image. The training process of the thumbnail generation network is as follows: determining view descriptors in combination with the thumbnail images under the normal map, the depth map and the color channels; constructing a semantic descriptor according to the thumbnail image; constructing a depth descriptor according to the depth map, screening out partial thumbnail images by combining the semantic descriptor, the depth descriptor and the view descriptor to generate a training set; the thumbnail generation network is trained according to the view descriptor, the depth descriptor, the semantic descriptor and the training set. According to the invention, the obtained multiple thumbnail images, the corresponding depth map and the normal map are input into the trained thumbnail generation network to output the corresponding optimal thumbnail image, so that the accuracy of thumbnail image generation of the building structural column is improved.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

Claims

1. An intelligent generation method for a building structure column image thumbnail for intelligent construction is characterized by comprising the following steps:

the training process of the thumbnail generation network is as follows:

2. The intelligent generation method of a building structure column image thumbnail for intelligent construction according to claim 1, wherein the determining a view descriptor by combining the change characteristics of the normal values corresponding to the pixels in the grids in the normal map, the depth values corresponding to the pixels in the corresponding grids in the depth map, and the contrast of the corresponding grids in the thumbnail image under each color channel comprises:

3. The intelligent generation method of the building structure column image thumbnail for intelligent construction according to claim 2, wherein the step of obtaining the index value of each grid by combining the change characteristic of the normal value corresponding to each pixel point in each grid in the normal map and the depth value corresponding to each pixel point in the corresponding grid in the depth map comprises the following steps:

4. The intelligent generation method of the building structure column image thumbnail for intelligent construction according to claim 2, wherein the obtaining the joint value of each grid under each color channel according to the contrast of each grid in the thumbnail image under each color channel and the contrast of the corresponding grid in the depth map comprises:

5. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 1, wherein the dividing thumbnail image gets at least two divided categories, and semantic descriptors are constructed from the area ratio of each divided category, comprising:

6. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 5, wherein the constructing a depth descriptor from a depth map comprises:

7. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 1, wherein the screening out part of thumbnail images according to the difference of degree of distinction index, the difference of semantic descriptors and the difference of view descriptors between thumbnail images generates a training set, comprising:

8. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 7, wherein the determining local anomaly factors of the respective thumbnail images according to the difference of the discrimination index between the thumbnail images, the difference of the semantic descriptors, and the difference of the view descriptors comprises:

9. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 8, wherein the determining the reachable distance between the thumbnail images according to the difference of the discrimination index between the thumbnail images, the difference of the semantic descriptors, and the difference of the view descriptors comprises:

10. The intelligent generation method of building structure column image thumbnail for intelligent construction according to claim 7, wherein the screening out part of thumbnail images according to local anomaly factors generates a training set, comprising: