CN113902712A

CN113902712A - Image processing method, device, equipment and medium based on artificial intelligence

Info

Publication number: CN113902712A
Application number: CN202111186574.9A
Authority: CN
Inventors: 张翼腾; 陈雪锦; 王鑫; 张润泽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-07

Abstract

The application provides an image processing method, an image processing device, electronic equipment, a computer readable storage medium and a computer program product based on artificial intelligence; relates to artificial intelligence technology; the method comprises the following steps: carrying out feature extraction processing on the image comprising the building structure to obtain initial features of the building structure; performing polymerization treatment based on a column space on the initial characteristics of the building structure to obtain column context characteristics of the building structure; performing aggregation processing based on a line space on the initial characteristics of the building structure to obtain line context characteristics of the building structure; performing fusion processing based on the upper and lower context features and the row context features of the building structure to obtain enhanced features of the building structure; and carrying out vertical surface element detection processing based on the enhanced features of the building structure to obtain the position information of the vertical surface elements in the building structure. Through the method and the device, the accuracy of detection of the building structure in the image can be improved.

Description

Image processing method, device, equipment and medium based on artificial intelligence

Technical Field

The present application relates to artificial intelligence technology, and in particular, to an image processing method and apparatus based on artificial intelligence, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.

Image processing is one of the important applications in the field of artificial intelligence, and is capable of determining position information of facade elements in a building structure in an image including the building structure, so that subsequent post-processing of the building structure is performed based on the position information of the facade elements.

The lack of an effective solution for image processing in the related art relies primarily on detecting regions of ground elements in the building structure from dense building classification results. However, the method detects inaccurate areas of the facade elements, and wastes a large amount of computing resources.

Disclosure of Invention

The embodiment of the application provides an image processing method and device based on artificial intelligence, an electronic device, a computer readable storage medium and a computer program product, which can improve the accuracy of detection of building structures in images.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image processing method based on artificial intelligence, which comprises the following steps:

carrying out feature extraction processing on an image comprising a building structure to obtain initial features of the building structure;

performing column space-based aggregation processing on the initial features of the building structure to obtain column context features of the building structure;

performing aggregation processing based on a line space on the initial characteristics of the building structure to obtain line context characteristics of the building structure;

performing fusion processing based on the column context features and the row context features of the building structure to obtain enhanced features of the building structure;

and carrying out facade element detection processing based on the enhanced features of the building structure to obtain the position information of the facade elements in the building structure.

In the above technical solution, the performing line space-based aggregation processing on the initial features of the building structure to obtain line context features of the building structure includes:

performing line attention processing on the initial features of the building structure to obtain a line attention diagram of the building structure;

and performing context aggregation processing based on the row attention diagram of the building structure to obtain the column context characteristics of the building structure.

The embodiment of the application provides an image processing device based on artificial intelligence, includes:

the system comprises a characteristic extraction module, a feature extraction module and a feature extraction module, wherein the characteristic extraction module is used for carrying out characteristic extraction processing on an image comprising a building structure to obtain initial characteristics of the building structure;

the first aggregation module is used for carrying out aggregation processing based on a column space on the initial characteristics of the building structure to obtain column context characteristics of the building structure;

the second aggregation module is used for carrying out aggregation processing based on a line space on the initial characteristics of the building structure to obtain line context characteristics of the building structure;

the fusion module is used for performing fusion processing on the upper context feature and the lower context feature of the building structure to obtain the enhanced feature of the building structure;

and the detection module is used for carrying out facade element detection processing based on the enhanced features of the building structure to obtain the position information of the facade elements in the building structure.

In the above technical solution, the first aggregation module is further configured to perform a column attention process on the initial features of the building structure to obtain a column attention map of the building structure;

and performing context aggregation processing based on the column attention diagrams of the building structure to obtain column context characteristics of the building structure.

In the above technical solution, the first aggregation module is further configured to perform value-feature-based mapping processing on the initial feature of the building structure to obtain a value feature map of the building structure;

performing column feature extraction processing on the value feature map of the building structure to obtain column features in the value feature map;

and weighting the column features in the value feature map based on the column attention map of the building structure to obtain the column context features of the building structure.

In the above technical solution, the first aggregation module is further configured to perform mapping processing based on query features on the initial features of the building structure to obtain a query feature map of the building structure;

mapping processing based on key features is carried out on the initial features of the building structure to obtain a key feature map of the building structure;

and performing column correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a column attention map of the building structure.

In the above technical solution, the first aggregation module is further configured to perform column feature extraction processing on the key feature map of the building structure to obtain column features of the key feature map;

and carrying out correlation processing on the query feature map of the building structure based on the column features of the key feature map to obtain a column attention map of the building structure.

In the above technical solution, the first aggregation module is further configured to execute the following processing for any one of a plurality of positions in the query feature map:

determining a query feature vector for the location based on a query feature map of the building structure;

determining a column feature vector for the location based on column features of the key feature map;

determining an attention weight for the location based on the query feature vector for the location and the column feature vector for the location;

and combining the attention weights corresponding to the plurality of positions to obtain the column attention diagram of the building structure.

In the above technical solution, the second aggregation module is further configured to perform line attention processing on the initial feature of the building structure to obtain a line attention map of the building structure;

In the above technical solution, the fusion module is further configured to perform splicing processing on the context features of the columns and the context features of the rows of the building structure to obtain the context features of the building structure;

mapping the context characteristics of the building structure to obtain the mapping characteristics of the building structure;

and adding the mapping characteristics of the building structure and the initial characteristics of the building structure to obtain the enhanced characteristics of the building structure.

In the above technical solution, the detection module is further configured to perform central point prediction processing based on a facade element on the enhanced feature of the building structure to obtain central point information of the facade element in the building structure;

carrying out size prediction processing based on a facade element on the enhanced feature of the building structure to obtain size information of the facade element in the building structure;

and determining the position information of the facade elements in the building structure based on the central point information of the facade elements in the building structure and the size information.

In the above technical solution, the detection module is further configured to perform offset prediction processing based on a facade element on the enhanced feature of the building structure to obtain offset information of the facade element in the building structure;

adding the offset information of the facade elements in the building structure and the central point information to obtain standard central point information of the facade elements in the building structure;

and determining the position information of the facade elements in the building structure based on the standard central point information of the facade elements in the building structure and the size information.

In the technical scheme, the image processing method is realized by calling a neural network model; the device further comprises:

the training module is used for carrying out vertical surface element prediction processing on an image sample comprising a building structure through the initialized neural network model to obtain the predicted position information of the vertical surface elements in the image sample;

constructing a position loss function of the neural network model based on the predicted position information and the position labels of the vertical elements in the image sample;

and updating parameters of the neural network model based on the position loss function, and taking the updated parameters of the neural network model as the parameters of the trained neural network model.

In the above technical solution, the predicted position information is represented by predicted central point information, predicted offset information, and predicted size information;

the training module is further used for constructing a central point loss function of the neural network model based on the predicted central point information and the central point label of the facade element in the image sample;

constructing an offset loss function of the neural network model based on the predicted offset information and offset labels of the facade elements in the image sample;

constructing a size loss function of the neural network model based on the predicted size information and the size labels of the facade elements in the image sample;

and carrying out weighted summation processing on the central point loss function, the offset loss function and the size loss function to obtain a position loss function of the neural network model.

An embodiment of the present application provides an electronic device for image processing, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the image processing method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the artificial intelligence based image processing method provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of integrating the column context characteristics and the column context characteristics of the building structure in the image to obtain the enhancement characteristics of the building structure in the image, and carrying out facade element detection processing based on the enhancement characteristics of the building structure, so that the spatial arrangement regularity of the building structure in the image in rows and columns is effectively utilized to obtain the position information of the facade elements in the building structure, the accuracy of detection of the building structure in the image is improved, and compared with a scheme for detecting the building structure based on a dense building classification result, the method saves related computing resources.

Drawings

FIG. 1 is a schematic diagram of an application scenario of an image processing system provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

3A-3C are schematic flow charts of image processing methods based on artificial intelligence provided by embodiments of the present application;

FIG. 4 is a schematic diagram of a structure of a column context branch provided in an embodiment of the present application;

FIG. 5 is a block diagram of a row context branch provided by an embodiment of the present application;

FIG. 6 is a diagram illustrating a parsing result provided by the related art;

FIG. 7 is a schematic diagram of a parsing result provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a building facade analysis result loading process provided in the embodiment of the present application;

FIG. 9 is a rendering schematic diagram of a building facade model provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a programmed building facade model according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a building facade editor provided by an embodiment of the application;

FIG. 12 is a schematic diagram of a city street scene model construction provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of the regularity of the layout of the facade elements provided by the embodiment of the present application;

FIG. 14 is a schematic diagram of a facade resolution network architecture based on element placement context according to an embodiment of the present application;

FIG. 15 is a schematic diagram illustrating a principle of a self-attention mechanism provided by an embodiment of the present application;

FIG. 16 is a schematic diagram illustrating a context aggregation principle of a column branch and a row branch according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a detector head structure based on center point prediction according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a facade element bounding box prediction provided by an embodiment of the present application;

FIG. 19 is a qualitative visualization on an ECP data set provided by an embodiment of the present application;

fig. 20 is a graph of accuracy versus recall on a CMP data set provided by an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, references to the terms "first", "second", and the like are only used for distinguishing similar objects and do not denote a particular order or importance, but rather the terms "first", "second", and the like may be used interchangeably with the order of priority or the order in which they are expressed, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated and described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) A client: applications running in the terminal for providing various services, such as a video client, a game client, and the like.

2) In response to: for indicating the condition or state on which the performed operation depends, when the condition or state on which the performed operation depends is satisfied, the performed operation or operations may be in real time or may have a set delay; there is no restriction on the order of execution of the operations performed unless otherwise specified.

3) Building facade elements: the interface between the building and the exterior space of the building in direct contact presents images, components, such as windows, doors, balconies, wiring feet, etc.

The embodiment of the application provides an image processing method and device based on artificial intelligence, electronic equipment and a computer readable storage medium, and the accuracy of detection of an architectural structure in an image can be improved.

The image processing method based on artificial intelligence provided by the embodiment of the application can be independently realized by a terminal/a server; the method can also be implemented by cooperation of a terminal and a server, for example, the terminal solely undertakes an artificial intelligence-based image processing method described below, or the terminal sends a detection request for an image to be detected (including an image of a building structure) to the server, the server executes the artificial intelligence-based image processing method according to the received detection request for the image to be detected, performs column-space-based aggregation processing on initial features of the building structure in the image to obtain column context features of the building structure, performs row-space-based aggregation processing on the initial features of the building structure to obtain row context features of the building structure, performs facade element detection processing on the basis of the column context features and the row context features of the building structure to obtain position information of the facade elements in the image, thereby effectively utilizing spatial arrangement regularity of the building structure in the image on the rows and the columns, the position information of the vertical surface elements in the building structure is obtained, and the accuracy of detection of the building structure in the image is improved.

The electronic device for image processing provided by the embodiment of the application can be various types of terminals or servers, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart television, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Taking a server as an example, for example, the server cluster may be deployed in a cloud, and open an artificial intelligence cloud Service (AI as a Service, AIaaS) to users, the AIaaS platform may split several types of common AI services, and provide an independent or packaged Service in the cloud, this Service mode is similar to an AI theme mall, and all users may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface.

For example, one of the artificial intelligence cloud services may be an image processing service, that is, a cloud server is packaged with the image processing program provided in the embodiment of the present application. A user calls an image processing service in a cloud service through a terminal (a client is operated, such as a detection client) so that a server deployed at the cloud end calls a packaged image processing program, the initial features of a building structure in an image are subjected to column space-based aggregation processing to obtain column context features of the building structure, the initial features of the building structure are subjected to row space-based aggregation processing to obtain row context features of the building structure, and facade element detection processing is performed on the basis of the column context features and the row context features of the building structure to obtain position information of facade elements in the building structure.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of an image processing system 10 provided in an embodiment of the present application, a terminal 200 is connected to a server 100 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

The terminal (running a client, such as a building detection client) may be used to obtain a detection request for an image to be detected (including an image of a building structure), for example, when a user opens the detection client running on the terminal, selects an image including a building structure, and the terminal automatically obtains a detection request for an image to be detected (including an image of a building structure).

In some embodiments, an image processing plug-in may be embedded in a client running in the terminal 200, so as to implement an artificial intelligence based image processing method locally on the client. For example, the terminal 200 calls an image processing plug-in to implement an artificial intelligence-based image processing method, performs column space-based aggregation processing on the initial features of the building structure in the image to obtain column context features of the building structure, performs row space-based aggregation processing on the initial features of the building structure to obtain row context features of the building structure, and performs facade element detection processing on the column context features and the row context features of the building structure to obtain position information of facade elements in the building structure, so as to effectively utilize spatial arrangement regularity of the building structure in the image on the rows and columns to obtain position information of the facade elements in the building structure in the image, improve detection accuracy of the building structure in the image, and facilitate subsequent post-processing of the building structure based on the position information of the facade elements, such as virtual building modeling in a game, and the like, Urbanization simulation modeling and the like.

In some embodiments, after the terminal 200 obtains a detection request for an image to be detected (including an image of a building structure), an image processing interface (which may be provided in the form of a cloud service, that is, an image processing service) of the server 100 is called, the server 100 performs a column-space-based aggregation process on initial features of the building structure in the image based on the detection request for the image to be detected, to obtain column contextual features of the building structure, performs a row-space-based aggregation process on the initial features of the building structure, to obtain row contextual features of the building structure, performs a facade element detection process based on the column contextual features and the row contextual features of the building structure, to obtain position information of a facade element in the building structure, so as to effectively utilize spatial arrangement regularity of the building structure in rows and columns in the image to obtain position information of the building structure element in the image, and the position information of the facade element is sent to the terminal 200, so that the position of the facade element in the image to be detected is presented in the terminal 200, the detection accuracy of the building structure in the image is improved, and the subsequent post-processing of the building structure, such as virtual building modeling in games, urbanization simulation modeling and the like, is performed based on the position information of the facade element.

In some embodiments, the terminal or the server may implement the artificial intelligence based image processing method provided by the embodiments of the present application by running a computer program, which is a client running in the terminal 200 as shown in fig. 1, for example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

In some embodiments, multiple servers may be grouped into a blockchain, and the server 100 is a node on the blockchain, and there may be an information connection between each node in the blockchain, and information transmission between the nodes may be performed through the information connection. Data (for example, logic of image processing, and location information of facade elements) related to the artificial intelligence based image processing method provided in the embodiment of the present application may be stored in the blockchain.

The following describes a structure of an electronic device provided in an embodiment of the present application, referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present application, where the electronic device 500 may be a terminal or a server, and the electronic device 500 in the embodiment of the present application is described by taking the electronic device 500 as an example, and the electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating with other electronic devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

in some embodiments, the artificial intelligence based image processing apparatus provided by the embodiments of the present application can be implemented in software, and fig. 2 shows an artificial intelligence based image processing apparatus 555 stored in a memory 550, which can be software in the form of programs and plug-ins, and the like, and includes the following software modules: the feature extraction module 5551, the first aggregation module 5552, the second aggregation module 5553, the fusion module 5554, the detection module 5555, and the training module 5556 are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

As described above, the artificial intelligence based image processing method provided by the embodiment of the present application can be implemented by various types of electronic devices. Referring to fig. 3A, fig. 3A is a schematic flowchart of an artificial intelligence based image processing method provided in an embodiment of the present application, and is described with reference to the steps shown in fig. 3A.

In the following steps, the building structure may be a real building, a virtual building model, or the like.

In step 101, feature extraction processing is performed on an image including a building structure to obtain an initial feature of the building structure.

As an example of obtaining the image including the building structure, when a user selects an image including the building structure (i.e., an image to be detected) through a terminal, the terminal automatically obtains a detection request for the image to be detected and sends the detection request for the image to be detected to a server, the server receives the detection request for the image to be detected and analyzes the detection request for the image to be detected to obtain the image including the building structure, and performs a feature extraction process on the image including the building structure through a feature extraction network to obtain an initial feature of the building structure, so that a subsequent facade analysis is performed based on the initial feature of the building structure to detect a facade element. The initial feature of the building structure is a low-order feature subjected to preliminary feature extraction, and represents an overall feature (for example, position information, attribute information, pixel value, and the like of each pixel) of the building structure included in the image, and the initial feature may include other features except the building structure in the image.

It should be noted that the embodiment of the present application is not limited to the specific structure of the feature extraction network, for example, the feature extraction network may be an Hourglass (Hourglass) network, and may also be a convolutional neural network. For example, the image including the building structure is subjected to a feature extraction process by a Hourglass network, and a depth feature (i.e., an initial feature) of the building structure is obtained.

In step 102, a column space-based aggregation process is performed on the initial features of the building structure to obtain column context features of the building structure.

For example, based on the spatial arrangement regularity of the building structure in the row and column in the image, the column-space-based aggregation processing can be performed on the initial features of the building structure through the column context branches in the element arrangement context module to obtain the column context features of the building structure, and then the elevation element detection processing is performed by integrating the row context features and the column context features of the building structure in order to effectively utilize the spatial arrangement regularity of the building structure in the row and column in the image, improve the detection accuracy of the building structure in the image, and save related computing resources compared with a scheme for detecting the building structure based on a dense building classification result. Where the column context feature of the building structure represents the correlation between each position in the image and all positions on the corresponding column, e.g. for a certain position p ═ i, j in the image, the column context feature represents the correlation between position p and all positions on the jth column

Referring to fig. 3B, fig. 3B is an alternative flowchart of the artificial intelligence based image processing method according to the embodiment of the present application, and fig. 3B shows that step 102 in fig. 3A can be implemented by steps 1021 to step 1022: in step 1021, performing a column attention process on the initial features of the building structure to obtain a column attention map of the building structure; in step 1022, a context aggregation process is performed based on the column attention map of the building structure to obtain column context features of the building structure.

For example, the initial feature (i.e., feature map F) of the building structure is subjected to column attention processing by a self-attention mechanism to obtain a building structureColumn attention map of building structure (i.e. column attention map A)_col) Performing context aggregation processing based on the column attention diagram of the building structure to obtain column context characteristics S of the building structure_colThe correlation between each position p ═ i, j in the building structure and all positions on the j-th column is calculated through the column context branch.

In some embodiments, performing the context aggregation process based on the column attention map of the building structure results in column context features of the building structure, including: carrying out mapping processing based on value characteristics on the initial characteristics of the building structure to obtain a value characteristic diagram of the building structure; performing column characteristic extraction processing on the value characteristic diagram of the building structure to obtain column characteristics in the value characteristic diagram; and weighting the column features in the value feature map based on the column attention map of the building structure to obtain the column context features of the building structure.

As shown in fig. 4, the value-based feature mapping process is performed on the initial feature of the building structure (i.e., the feature map F in fig. 4) by the self-attention mechanism to obtain a value feature map of the building structure (i.e., the value feature map V in fig. 4), and the column feature extraction process is performed on the value feature map V of the building structure to obtain a column feature Ω in the value feature map V_pAnd based on the column attention diagram of the building structure, the column characteristic omega in the value characteristic diagram is subjected to_pWeighting to obtain the column and context characteristics S of the building structure_row。

For example, the value feature map V has a size of C × H × W, and for a certain position p ═ i, j, the value extraction process may obtain a column feature Ω_pThis set is made up of C vectors, the result being Ω_pIs defined as the c-th member of

Wherein, V_cijThe value at (i, j) on the c-th channel representing the value signature V. Wherein the values of the column attention map of the building structure at the position p are used as the column characteristic Ω, respectively_pFor context aggregation at position p, the aggregation process resulting in a column context feature vector at position p

Wherein the content of the first and second substances,

column attention maps representing building structures incorporating values at (i, j) on the c-th channel for all positions in the building structure

To obtain the column context characteristics S of the building structure_row。

In some embodiments, performing a column attention process on an initial feature of a building structure to obtain a column attention map of the building structure comprises: carrying out mapping processing based on query features on the initial features of the building structure to obtain a query feature map of the building structure; mapping processing based on key features is carried out on the initial features of the building structure to obtain a key feature map of the building structure; and performing column correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a column attention map of the building structure.

As shown in fig. 4, the query feature map of the building structure (i.e., query feature map Q in fig. 4) is obtained by performing a mapping process based on the query feature on the initial feature of the building structure (i.e., feature map F in fig. 4) by the self-attention mechanism, the key feature map of the building structure (i.e., query feature map K in fig. 4) is obtained by performing a mapping process based on the key feature on the initial feature of the building structure by the self-attention mechanism, and the column attention map a of the building structure is obtained by performing a column correlation process by combining the query feature map of the building structure and the key feature map of the building structure through a column context branch_col。

In some embodiments, performing column correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a column attention map of the building structure includes: performing column characteristic extraction processing on the key characteristic diagram of the building structure to obtain column characteristics of the key characteristic diagram; and carrying out correlation processing on the query feature map of the building structure based on the column features of the key feature map to obtain a column attention map of the building structure.

As shown in fig. 4, the key feature map K has a size of C × H × W, and for a certain position p ═ i, j, the value extraction process can obtain the column feature Y_p＝{K_(1,j),K_(2,j),…,K_(u,j),…,K_(H,j)In which Y is_pIs H, for the query feature vector at p on the query feature map Q

Based on the query feature vector Q_pAnd column characteristic Y_pAnd performing correlation processing to obtain column attention weights at the position p, and combining the column attention weights of all positions in the building structure to obtain a column attention diagram of the building structure.

In some embodiments, performing correlation processing on the query feature map of the building structure based on the column features of the key feature map to obtain a column attention map of the building structure, includes: performing the following for any one of a plurality of locations in the query feature map: determining a query feature vector of a location based on a query feature map of the building structure; determining a column feature vector of the location based on the column features of the key feature map; determining an attention weight of the location based on the query feature vector of the location and the column feature vector of the location; and combining the attention weights corresponding to the positions to obtain the column attention diagram of the building structure.

For example, column feature Y at p in the key feature map_p＝{K_(1,j),K_(2,j),…,K_(i,j),…,K_(H,j)For a query feature vector Q at p on the query feature map Q_pBased on Q_pAnd Y_pDetermining attention weights at p

Wherein the content of the first and second substances,

is a vector^cA_pThe kth element of (1), and

is a set Y_pThe attention weights corresponding to all the positions of the kth feature vector form a column attention map of the building structure.

In step 103, the initial features of the building structure are aggregated based on the line space to obtain the line context features of the building structure.

For example, based on the spatial arrangement regularity of the building structure in the rows and columns, the row-space-based aggregation processing may be performed on the initial features of the building structure through the row context branches in the element arrangement context module to obtain the row context features of the building structure, and then the column context features may be combined to perform the facade element detection processing by integrating the row context features and the column context features of the building structure, so as to effectively utilize the spatial arrangement regularity of the building structure in the rows and columns in the image, improve the accuracy of detection of the building structure in the image, and save the related computing resources compared with the scheme of detecting the building structure based on the dense building classification result. In the embodiment of the present application, there is no obvious sequence between step 102 and step 103. Where the line context feature of the architectural structure represents the correlation between each location in the image and all locations on the corresponding line, for example, for a certain location p ═ i, j in the image, the line context feature represents the correlation between location p and all locations on the ith line.

In some embodiments, performing a row space-based aggregation process on the initial features of the building structure to obtain row context features of the building structure includes: performing line attention processing on the initial characteristics of the building structure to obtain a line attention diagram of the building structure; and performing context aggregation processing based on the row attention diagram of the building structure to obtain the column context characteristics of the building structure.

For example, the initial feature (i.e., feature map F) of the building structure is subjected to row attention processing by the self-attention mechanism to obtain a row attention map (i.e., column attention map A) of the building structure_rowPerforming context aggregation processing based on the line attention diagram of the building structure to obtain line context characteristics S of the building structure_rowTo calculate the correlation between each position p ═ i, j in the building structure and all positions on the ith row through the row context branch.

In some embodiments, performing the context aggregation process based on the line attention map of the building structure to obtain the line context feature of the building structure comprises: carrying out mapping processing based on value characteristics on the initial characteristics of the building structure to obtain a value characteristic diagram of the building structure; performing line feature extraction processing on the value feature map of the building structure to obtain line features in the value feature map; and weighting the line features in the value feature diagram based on the line attention diagram of the building structure to obtain the line context features of the building structure.

As shown in fig. 5, the value-feature-based mapping process is performed on the initial feature of the building structure (i.e., the feature map F in fig. 5) by the self-attention mechanism to obtain a value feature map of the building structure (i.e., the value feature map V in fig. 5), and the line feature extraction process is performed on the value feature map V of the building structure to obtain the line feature Λ in the value feature map V_pAnd based on the line attention diagram of the building structure, performing line characteristic Lambda in the value characteristic diagram_pWeighting to obtain the line context characteristics S of the building structure_row。

For example, the size of the value feature map V is C × H × W, and for a certain position p ═ i, j, the value extraction process may obtain the line feature Λ_pThis set is made up of C vectors, the result being Λ_pIs defined as the c-th member of

Wherein, V_cijThe value at (i, j) on the c-th channel representing the value signature V. Wherein the values of the line attention map of the building structure at the position p are used as the line characteristic Λ, respectively_pFor context aggregation at position p, the aggregation process resulting in a line context feature vector at position p

Wherein the content of the first and second substances,

the row attention map representing the building structure is the value at (i, j) on the c-th channel, combined for all positions in the building structure

To obtain the line context characteristics S of the building structure_row。

In some embodiments, performing line attention processing on an initial feature of a building structure to obtain a line attention map of the building structure comprises: carrying out mapping processing based on query features on the initial features of the building structure to obtain a query feature map of the building structure; mapping processing based on key features is carried out on the initial features of the building structure to obtain a key feature map of the building structure; and performing line correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a line attention map of the building structure.

As shown in fig. 5, the query feature map of the building structure (i.e., query feature map Q in fig. 5) is obtained by performing a mapping process based on the query feature on the initial feature of the building structure (i.e., feature map F in fig. 5) by the self-attention mechanism, the key feature map of the building structure (i.e., query feature map K in fig. 5) is obtained by performing a mapping process based on the key feature on the initial feature of the building structure by the self-attention mechanism, and the line attention map a of the building structure is obtained by performing a line correlation process by combining the query feature map of the building structure and the key feature map of the building structure through a line context branch_row。

In some embodiments, performing a line correlation process based on the query feature map of the building structure and the key feature map of the building structure to obtain a line attention map of the building structure includes: performing line feature extraction processing on the key feature diagram of the building structure to obtain line features of the key feature diagram; and carrying out correlation processing on the query feature map of the building structure based on the line features of the key feature map to obtain a line attention map of the building structure.

As shown in fig. 5, the key feature map K has a size of C × H × W, and for a certain position p ═ i, j, the value extraction process can obtain the line feature X_p＝{K_(i,j),K_(i,2),…,K_(i,j),…,K_(i,W)In which X_pIs W, for the query feature vector at p on the query feature map Q

Based on the query feature vector Q_pAnd line feature X_pAnd performing correlation processing to obtain the line attention weight at the position p, and combining the line attention weights of all positions in the building structure to obtain the line attention diagram of the building structure.

In some embodiments, performing correlation processing on the query feature map of the building structure based on the line features of the key feature map to obtain a line attention map of the building structure, includes: performing the following for any one of a plurality of locations in the query feature map: determining a query feature vector of a location based on a query feature map of the building structure; determining a row feature vector of the location based on the row features of the key feature map; determining an attention weight of the location based on the query feature vector of the location and the row feature vector of the location; and combining the attention weights corresponding to the positions to obtain the line attention diagram of the building structure.

For example, a row feature X at p in the key feature map_p＝{K_(i,1),K_(i,2),…,K_(i,j),…,K_(i,W)For a query feature vector Q at p on the query feature map Q_pBased on Q_pAnd X_pDetermining attention weights at p

Wherein the content of the first and second substances,

is a vector

The kth element of (1), and

is a set X_pThe attention weights corresponding to all the positions of the kth feature vector form a line attention diagram of the building structure.

In step 104, a fusion process is performed based on the context-up and context-down features and the context-down features of the building structure to obtain enhanced features of the building structure.

For example, after obtaining the column context features and the row context features of the building structure, the column context features and the row context features of the building structure need to be fused to obtain the enhanced features of the building structure, for example, the column context features and the row context features of the building structure are spliced to obtain the enhanced features of the building structure. The enhanced features of the building structure represent the correlation between each position in the building structure and all positions on the corresponding row or column, the spatial arrangement regularity of the building structure in the image on the row and column is subsequently effectively utilized, the detection accuracy of the building structure in the image is improved, and compared with a scheme for detecting the building structure based on a dense building classification result, the related computing resources are saved.

In some embodiments, performing a fusion process based on the above-column context features and the below-row context features of the building structure to obtain enhanced features of the building structure includes: splicing the upper and lower context characteristics and the row context characteristics of the building structure to obtain the context characteristics of the building structure; mapping the context characteristics of the building structure to obtain the mapping characteristics of the building structure; and adding the mapping characteristics of the building structure and the initial characteristics of the building structure to obtain the enhanced characteristics of the building structure.

For example, a column context feature S of a building structure_colAnd line context feature S_rowAnd cascading together to obtain the context feature S of the building structure, and generating a feature map M omega (S) rich in context information through convolution layer processing, namely the mapping feature of the building structure. The element placement context module then processes the feature map M and the initial features F of the building structure in an element-by-element addition and generates an enhanced feature map F', i.e., an enhanced feature of the building structure.

In step 105, a facade element detection process is performed based on the enhanced features of the building structure to obtain position information of the facade element in the building structure.

For example, after the enhanced features of the building structure are obtained, the spatial arrangement regularity of the building structure in rows and columns is effectively utilized, the head of the detector is used for detecting the elevation elements of the enhanced features, the position information of the elevation elements in the building structure is obtained, the accuracy of building structure detection is improved, and compared with a scheme for detecting the building structure based on a dense building classification result, related computing resources are saved.

Referring to fig. 3C, fig. 3C is an alternative flowchart of the artificial intelligence based image processing method according to the embodiment of the present application, and fig. 3C shows that step 105 in fig. 3A can be implemented by steps 1051 to 1053: in step 1051, performing central point prediction processing based on the facade element on the enhanced feature of the building structure to obtain central point information of the facade element in the building structure; in step 1052, performing size prediction processing based on the facade element on the enhanced feature of the building structure to obtain size information of the facade element in the building structure; in step 1053, position information for the facade element in the building structure is determined based on the center point information and the size information for the facade element in the building structure.

For example, center point prediction processing based on the facade elements is performed on the enhanced features of the building structure to obtain center point coordinates (i.e., center point information) of each facade element in the building structure, size prediction processing based on the facade elements is performed on the enhanced features of the building structure to obtain width and height dimensions (i.e., size information) of each facade element in the building structure, and vertex coordinates (i.e., position information) of a bounding box of each facade element in the building structure is determined based on the center point information and the size information of each facade element in the building structure.

In some embodiments, determining location information of facade elements in a building structure based on center point information and size information of the facade elements in the building structure comprises: performing offset prediction processing based on the facade elements on the enhanced features of the building structure to obtain offset information of the facade elements in the building structure; adding the offset information and the center point information of the neutral surface elements in the building structure to obtain standard center point information of the neutral surface elements in the building structure; and determining the position information of the neutral surface element in the building structure based on the standard central point information and the size information of the neutral surface element in the building structure.

For example, the input image to be detected is downsampled to be detected and to be detected for subsequent processing such as aggregation, fusion, detection and the like, and the position coordinates of the voxel are processed under the resolution of the original image, so that the position of the central point is predicted by directly using the enhanced feature map F', and certain precision loss is caused. Therefore, the offset prediction processing based on the facade element is carried out on the enhanced feature of the building structure through the local offset prediction branch, and the offset information of the facade element in the building structure is obtained

Representing two-dimensional offset vectors corresponding to different positions for adjusting the center point position to restore accuracy.

For example, for any center point (x) in the set of center points for each facade element_k,y_k) The corresponding offset (i.e., offset information) is

Then, the corresponding center point coordinate is ν ═ x_k+Δx_k,y_k+Δy_k). After the central point position v is obtained, the geometric expression of the bounding box of the object can be obtained by combining the width and the height of the object, so as to determine the position information of the facade element corresponding to the central point position v.

In some embodiments, the image processing method is implemented by calling a neural network model; the training process of the neural network model comprises the following steps: performing elevation element prediction processing on an image sample comprising a building structure through an initialized neural network model to obtain predicted position information of elevation elements in the image sample; constructing a position loss function of the neural network model based on the predicted position information and the position labels of the vertical elements in the image sample; and updating parameters of the neural network model based on the position loss function, and taking the updated parameters of the neural network model as the parameters of the trained neural network model.

For example, after performing feature extraction processing on an image sample including a building structure through an initialized neural network model to obtain initial features of the building structure, performing aggregation processing based on a column space on the initial features of the building structure to obtain column context features of the building structure, performing aggregation processing based on a row space on the initial features of the building structure to obtain row context features of the building structure, performing fusion processing based on the column context features and the row context features of the building structure to obtain enhanced features of the building structure, performing facade element detection processing based on the enhanced features of the building structure to obtain predicted position information of facade elements in the image sample, and constructing a value of a position loss function (cross entropy loss function) of the neural network model based on the predicted position information and a position label of the facade elements in the image sample, it can be determined whether the value of the position loss function exceeds a preset threshold value, and when the value of the position loss function exceeds a preset threshold value, determining an error signal of the neural network model based on the position loss function, reversely transmitting error information in the neural network model, and updating the model parameters of each layer in the transmission process.

Here, describing the back propagation, inputting training sample data into an input layer of a neural network model, passing through a hidden layer, finally reaching an output layer and outputting a result, which is a forward propagation process of the neural network model, because the output result of the neural network model has an error with an actual result, calculating an error between the output result and the actual value, and reversely propagating the error from the output layer to the hidden layer until the error is propagated to the input layer, in the process of the back propagation, adjusting the value of a model parameter according to the error, namely constructing a loss function according to the error between the output result and the actual value, and calculating the partial derivative of the loss function on the model parameter layer by layer to generate the gradient of the loss function on the model parameter of each layer, because the direction of the gradient indicates the direction of error expansion, the gradient of the model parameter is inverted, and the original parameter of each layer is summed, the obtained summation result is used as the updated model parameter of each layer, so that the error caused by the model parameter is reduced; and continuously iterating the process until convergence.

In some embodiments, the predicted position information is characterized by predicted center point information, predicted offset information, and predicted size information; constructing a position loss function of the neural network model based on the predicted position information and the position labels of the vertical elements in the image sample, wherein the position loss function comprises the following steps: constructing a central point loss function of the neural network model based on the predicted central point information and the central point label of the vertical element in the image sample; constructing an offset loss function of the neural network model based on the predicted offset information and the offset labels of the neutral elements in the image sample; constructing a size loss function of the neural network model based on the predicted size information and the size labels of the cubic elements in the image sample; and carrying out weighted summation processing on the center point loss function, the offset loss function and the size loss function to obtain a position loss function of the neural network model.

For example, central point prediction processing based on a facade element is carried out on the enhanced features of the building structure to obtain the predicted central point information of the facade element in the building structure; carrying out size prediction processing based on the facade elements on the enhanced features of the building structure to obtain predicted size information of the facade elements in the building structure; performing offset prediction processing based on the facade elements on the enhanced features of the building structure to obtain predicted offset information of the facade elements in the building structure; and determining the predicted position information of the neutral surface element in the building structure based on the predicted central point information, the predicted offset information and the predicted size information of the neutral surface element in the building structure.

Wherein the central point loss function of the neural network model is

Wherein E is_kRepresents the center point information tag (i.e. the real center point information),

representing the predicted central point information, and N representing the total number of image samples; offset loss function for flooring elements in a building structure

Wherein, O_kIndicating an offset information tag (i.e. true offset information),

representing prediction offset information; dimensional loss function of ground elements in a building structure

Wherein, U_kA tag indicating size information (i.e. real size information),

and representing the predicted size information, and performing weighted summation processing on the center point loss function, the offset loss function and the size loss function to obtain the position loss function of the neural network model, so that the center point information, the offset information and the size information of the facade elements are fully learned, and the accuracy of detecting the position information of the facade elements is improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application can be applied to an application scene comprising image detection of a building structure, for example, an application scene related to virtual building modeling in a game, wherein an image containing the building structure (building) is subjected to aggregation processing to obtain the column and row context features of the building structure, and facade element detection processing is performed on the basis of the column and row context features of the building structure, so that the space arrangement regularity of the building structure in the image on the rows and the columns is effectively utilized to obtain accurate position information of facade elements, and virtual building modeling in the game is performed on the basis of the position information of the facade elements to construct a virtual building structure similar to the building structure in the image, and the immersion of a user in the game is improved; regarding an application scene of the urban simulation modeling, carrying out aggregation processing on an image containing a building structure to obtain the upper and lower characteristics and the row and lower characteristics of the building structure, and carrying out vertical surface element detection processing based on the upper and lower characteristics and the row and lower characteristics of the building structure, so that the spatial arrangement regularity of the building structure in rows and columns in the image is effectively utilized to obtain accurate position information of vertical surface elements, and carrying out simulation modeling based on the position information of the vertical surface elements to simulate the building structure in the image to construct an urban simulation virtual model; regarding the application scene of the safety performance analysis, the vertical surface element detection processing is carried out based on the upper and lower context characteristics and the row context characteristics of the building structure in the image, so that the space arrangement regularity of the building structure in the image on the rows and the columns is effectively utilized to obtain accurate position information of the vertical surface element, and the safety performance analysis is carried out based on the position information of the vertical surface element to accurately carry out safety early warning in real time. The following description takes an application scenario of virtual building modeling in a game as an example:

in the related art, as shown in fig. 6, in the process of detecting building structures in an image, a semantic segmentation network is used to generate dense pixel-level classification results, such as dense balcony regions 601, such an image processing method cannot well express mutually overlapped or nested facade element regions, a dense pixel set cannot directly express independent facade elements and does not directly contain clear geometric descriptions of the facade element regions, and facade segmentation also often generates irregular segmentation regions, which brings additional complexity to the programming modeling process of the building facade.

In order to solve the above problems, the embodiment of the present application provides a spatial context aggregation method for analyzing a building facade based on a target detection frame, where context information in two directions of a row and a column is aggregated, so that spatial arrangement regularity and appearance similarity of building facade elements in an image on the row and the column are effectively utilized, and building facade layout rules are embedded in a deep convolutional neural network to guide an analysis process.

As shown in fig. 7, the embodiment of the present application analyzes various typical facade elements based on a target detection framework, and generates a regular element region expressed in a compact geometric form as an element analysis result, for example, a balcony surrounding frame 701, which can better support the programmatic modeling of a building facade. On the other hand, a context aggregation method is designed by considering the arrangement rule of the elements of the vertical surface, and the spatial contexts on the rows and the columns are aggregated to guide the analysis process of the vertical surface of the building, so that the robustness and the accuracy of the vertical surface analysis are improved.

It should be noted that, in the embodiment of the present application, the facade element object area with semantics is obtained by analyzing the building facade image, and the method and the device are applicable to rapid modeling and editing of the building facade in the urban street scene. In order to embody the application process, the method for analyzing the vertical face is based on the vertical face analyzing method of the embodiment of the application and relies on the vertical face model modeling and editing plug-in developed by three-dimensional creation software. The plug-in has the functions related to the construction of a facade model, and the work flow is as follows: building elevation analysis, analysis result storage and loading, building elevation polygon model construction and building elevation model editing. The different stages of the application flow are described below.

The facade analysis result obtained in the embodiment of the application is a plurality of semantic-containing parameterized rectangular regions, each facade element object corresponds to a structural body containing 5 parameters, wherein the first parameter is unsigned integer and is used for representing the facade element category, the rest 4 parameters are a group and represent the floating point number coordinates of the upper left corner and the lower right corner of the rectangle, and all the obtained facade analysis results can be stored as formatted data such as JSON (Java Server object notation) and the like, so that subsequent loading and analysis are facilitated.

As shown in fig. 8, after obtaining the vertical plane analysis result, click the load button 801 to load the vertical plane analysis result; as shown in fig. 9, rendering the facade parsing result to a two-dimensional surface 901; as shown in fig. 10, a building facade model 1001 expressed in polygons can be obtained by extruding regions of different facade element objects; as shown in fig. 11, the geometric parameters 1101 of the facade element object can be adjusted through the interface control of the plug-in to perform model editing and adjust the representation form of the model; after modeling is completed, a modeling sample shown in fig. 12, such as a part of the building model 1201, is presented, and the building model 1201 is applied to the game, so that the virtual building model in the game becomes more and more realistic, and the immersion of the user in the game is improved.

It should be noted that the building facade elements exhibit strong regularity in layout. As shown in fig. 13, the facade element objects of the same kind have a high degree of alignment both in the horizontal direction and in the vertical direction when viewed in spatial position, for example, windows are highly aligned both in the horizontal direction and in the vertical direction. The object of the elevation element positioned in a certain row or a certain column has strong correlation with other objects in the same row or the same column, and the correlation in the two directions provides extremely valuable context for the detection of the elevation element.

In summary, the embodiment of the present application utilizes the regularity of the building facade layout to assist the detection of the building facade elements, and provides an Element-Arrangement Context module (EACM) to capture the Element Arrangement spatial Context in the horizontal and vertical directions, and embeds the Arrangement regularity of the facade elements into the deep convolutional neural network to form a building facade analysis network.

As shown in fig. 14, the building facade analysis network analyzes a building facade image and outputs a facade element analysis result expressed by a parameterized bounding box. The building facade analysis network comprises three parts, namely a feature extraction network, an element arrangement context module and a detector head. The system comprises a detector head, a feature extraction network, an element arrangement context module, a space row context and a column context, wherein the feature extraction network is used for extracting depth features from a building facade image, the element arrangement context module captures and aggregates the space row context and the column context, and the detector head predicts a bounding box of a facade element by using the extracted features. In the reasoning stage, the input image sequentially passes through the three parts to obtain a resolution result. The input and output of each part of the network from the input of the image to the output of the analysis result will be described first.

It should be noted that the input image first goes through a stacked Hourglass network (i.e., a feature extraction network) to extract depth features. The Hourglass network downsamples the input image by four times and outputs the image with the size of HXw feature map F (i.e., depth feature). The feature graph F is then fed into an element arrangement context module, which can calculate the correlation between a certain position on the facade and all other positions on the same row or column, and the column context branch contained in the element arrangement context module can aggregate the context information on the column, and the row context branch contained in the element arrangement context module can aggregate the context information on the row. The feature map F is fed into the two branches, and the feature maps S are respectively output_colAnd S_row，S_colAnd S_rowHave the same dimensions. In order to assist the detection process by using the element arrangement context on the rows and columns, the two part feature maps need to be effectively fused. When the characteristics are merged, S_colAnd S_rowAre cascaded together first to obtain a characteristic diagram S. Then, S is processed by a convolutional layer having a 1 × 1 convolution kernel to perform feature adaptation, and a feature map M matching the size of the feature map F is obtained. The feature map M contains rich column-column spatial context information, which is used to enhance the depth feature, and the enhancement process is implemented as element-by-element addition. The enhanced features F' are fed into the detector header for predicting the parameterized bounding box and obtaining the final parsed result. The detector header in fig. 14 may be any detection method with the capability of performing bounding box parameter prediction by processing depth features, which illustrates that the element arrangement context module in the embodiment of the present application is plug-and-play, and can be used in a novel detection framework with the development of the detection method. The dimensions corresponding to the output profiles of the various parts of the network involved in the above process are shown in table 1.

TABLE 1 size of output feature maps of various parts of a facade analysis network

Characteristic diagram	Feature size (number of channels x height x width)
		F	C×H×W
S_col	C×H×W
		S_row	C×H×W
S	2C×H×W
		M	C×H×W
F′	C×H×W

The element arrangement context module comprises two parallel branches, namely an upper context branch and a line context branch, and is used for aggregating space context and guiding a network to pay attention to the facade elements arranged on the columns and the lines in an aligned mode. It should be noted that, by modeling the image long-range dependence using the self-attention mechanism, non-local information can be captured effectively. The embodiment of the application realizes two context branches in the element arrangement context module based on the attention mechanism, and optimizes the aggregation mode of non-local contexts by using a facade element arrangement rule. The self-attention mechanism is described first, and the technical details of the two context branches are described in detail.

Wherein, the self-attention mechanism calculates the correlation matrix for the whole graph and aggregates the non-local contexts by taking the values as the weights. As shown in FIG. 15, X_pRepresenting the vector at position p, by a calculation process, using it as input

Obtaining a correlation matrix W, and the calculation process is shown as formula (1):

wherein f is_qAnd f_kRespectively representing the query transformation function and the key transformation function.

After obtaining the correlation matrix W, performing non-local context aggregation with the value thereof as a weight, wherein the aggregation process γ is shown as formula (2):

Z_p＝∑_iW_p,if_v(X_i) (2)

wherein f is_vRepresenting a value transformation function, Z_pRepresenting the output vector corresponding to position p.

As shown in fig. 16, the element arrangement context module, following the principle of self-attention, applies three convolution layers with convolution kernel size of 1 × 1 to a feature map F in parallel, and obtains a query feature map Q, a key feature map K, and a value feature map V, respectively, where the sizes of the three feature maps are all C × H × W. Both branches of the element placement context module use Q, K and V to generate context features, i.e., the convolutional layers on both branches are weight shared. For a certain position p ═ i, j, the column context branch computes the correlation between position p and all positions on the j-th column, while the row context branch computes the correlation between position p and all positions on the i-th row.

For a query feature vector at p on the query feature map Q

Two branches of the element arrangement context module respectively extract feature vectors from the feature map K along the ith row and the jth column to form two vector sets as shown in formula (3):

wherein, the set X_pAnd Y_pThe cardinality of (a) is W and H, respectively.

In the column context branch, the correlation between position p and all the different positions on the same column is first calculated to form the column attention map A_colVector of (2)

As shown in equation (4):

wherein the content of the first and second substances,

is a vector^cA_pThe kth element of (1), and

is a set Y_pThe kth feature vector of (1).

For line context branching, and attention map A_colSimilar to the calculation process of (A), the element arrangement context module calculates a line attention diagram (A)_rowThe vector at p is shown in equation (5):

wherein the content of the first and second substances,

is a vector

The kth element of (1), and

is a set X_pThe kth feature vector of (1).

The above process results in a line attention map A_rowAnd column attention diagram a_colThe value of (c) measures the correlation between different rows and columns. After the intent calculation is completed, the element arrangement context module extracts values from the value feature graph V along the spatial dimension in the corresponding rows and columns for further context aggregation process. For a certain position p ═ i, j, the value extraction process can result in two sets, i.e. the row features Λ of the value feature map V_pAnd column characteristic omega_pBoth sets are composed of C vectors, the resulting set Λ_pAnd omega_pIs expressed as shown in equation (6):

wherein, V_cijThe value at (i, j) on the c-th channel representing the value signature V.

Wherein the correlation vector^cA_pAnd^rA_pare respectively used as vectors

And

for context aggregation at position p, the aggregation process yielding a vector

And

as shown in equation (7):

wherein, performing the spatial context aggregation process given by formula (7) for each position generates a line context feature S_rowAnd a column context feature S_colLine context feature S_rowAnd a column context feature S_colFused together for enhancement of the depth feature F. S_colAnd S_rowFirst cascaded together to obtain S, and through convolution layer processing, a feature map M rich in context information is generated. Subsequently, the element arrangement context module processes the feature maps M and F in an element-by-element addition manner, and generates an enhanced feature map F', which is shown in equation (8):

F′＝ω(S)+F (8)

where ω denotes a transform function, implemented by a convolution layer with a 1 × 1 convolution kernel.

After the local feature F is enhanced using the element placement context module, the resulting enhanced feature map F' is fed into the detector header to predict the parameterized bounding box of the facade element object. The facade elements exhibit a highly regular appearance, and thus symmetric facade object regions in a facade image can be implicitly encoded as a center point position and a width-to-height dimension parameter corresponding thereto. Based on this, the embodiments of the present application use a single-stage detection method based on center point prediction as the detector head.

As shown in the detector head of fig. 17, in order to predict the geometric parameters required to form the bounding box, the detector head uses three prediction branches, namely, a centerpoint heat map prediction branch, a local offset prediction branch, and a bounding box size prediction branch. These three branches are implemented by convolutional layers, which first process the input enhancement feature map F' using a 3 × 3 convolutional layer, and then apply one 1 × 1 convolutional layer to get the prediction result of the corresponding branch.

Wherein branches are predicted for a central point heat map, the central point heat map being a prediction result for multiple channels, the branches resulting in

It locates the center point of the facade element, C' represents the number of facade element classes, the value on the C channel at (i, j) on the heat map

Representing the probability of an object having facade element type c at that location; because the feature extraction network down-samples the input image four times, and the coordinate marking is performed under the resolution of the input image, this means that the direct use of F' to predict the central point position is accompanied by a certain precision loss, and for this problem, the local offset prediction branch generates a two-channel result

Representing two-dimensional offset vectors corresponding to different positions for adjusting the center point position to restore accuracy; for bounding box size predicted branches, the bounding box size predicted branch produces a two-channel result

The two channels correspond to the width and height dimensions of the facade element area respectively.

After obtaining three branches

And

then, they need to be combined to get the bounding box of the facade element as the final result of the facade parsing. As shown in FIG. 18, the coordinates of the center point of the facade element can be obtained by using the bounding box prediction diagram

Finding the maximum value of the 8-neighborhood. In implementation, peak extraction can be achieved by using a 3 × 3 max pooling layer. This step produces a set of midpoint points

I.e. the set of center points of all facade elements. For points in the set

To it is paired withThe local offset should be

Then, the calculation process of the corresponding center point coordinates is as shown in equation (9):

ν＝(x_k+Δx_k,y_k+Δy_k) (9)

after the central point position v is obtained, the geometric expression of the bounding box of the facade element can be obtained by combining the width and the height of the facade element. The bounding box of a facade element is represented by its two endpoints, top left and bottom right, as shown in equation (10):

wherein the content of the first and second substances,

to represent

Is located at (x)_k,y_k) The vector of (c).

Generated for detector head during network training

And

and (6) supervision is carried out. Since the backbone network of the feature extraction part down-samples the input image, the original annotation data needs to undergo some pre-processing to obtain the real data for loss calculation in order to supervise at the same resolution as the network output.

For example, the down-sampling factor of the backbone network is r, and for a certain position on the input image, p is (x, y), and its corresponding position at the network output resolution is shown in equation (11):

wherein the content of the first and second substances,

the bottoming function is represented.

For surveillance of the centroid prediction, the surveillance data should be a heat map that expresses the location of the centroid. To generate a heat map for surveillance, values near the midpoint annotation location are set in a two-dimensional Gaussian distribution, resulting in heat map E. The supervised data O (i.e., true data) of the local offset for precision recovery is calculated by equation (12):

for the midpoint prediction branch, the penalty function is used as shown in equation (13):

where N represents the number of facade elements and α and β represent hyper-parameters for controlling the contribution of different terms to the loss, which may be set to 2 and 4, respectively, as required.

Using criteria for both local offset prediction and wide-height prediction

The distance is a function of the loss. The loss function corresponding to the local offset is shown in equation (14):

wherein, O_kThe true value representing the local offset corresponding to the kth center point,

and the predicted value of the local offset corresponding to the kth central point is shown.

The penalty function for the prediction corresponding to the width-to-height dimension is shown in equation (15):

wherein, U_kA true value representing the width-height dimension corresponding to the kth center point,

and indicating the predicted value of the width and height dimension corresponding to the kth central point.

The final loss function is shown in equation (16):

L＝L_p+λL_o+μL_s (16)

wherein λ and μ represent scale factors for controlling the weight, respectively, and can be set to 1 and 0.1 according to requirements when implemented.

In order to train the facade resolution network provided by the embodiment of the application, a building facade image with a bounding box label is used. The building facade image may be obtained by collecting building images on the internet or public building facade data, or may be a building facade actually photographed. The bounding box parameter labels of the facade element objects can be obtained by using a labeling tool (such as labelme or the like). The data enhancement method used in the training phase comprises image level random inversion, image random scaling with scale factor in a certain interval (for example [0.6,1.3]), and image color dithering. During training, the original image needs to be randomly cropped or padded to a certain size before being input into the network to adapt to the input resolution of the network, and the original size of the original image can be adjusted according to the training data.

In summary, the element placement context module (EACM) in the embodiment of the present application, together with the necessary detector portion, forms the facade resolution network in the embodiment of the present application. As shown in table 2, quantitative and qualitative comparative analyses were performed on the public data set ECP with a facade resolution method (e.g., two versions of DeepFacade). In addition, to verify the effectiveness of the EACM of the present application embodiment, performance enhancement on challenging CMP building facade datasets was also demonstrated and quantitatively compared to the contextual aggregation method RCC a.

Table 2 results of quantitative evaluation on ECP data set

As shown in table 2, the quantitative results of the method of the present embodiment comprehensively surpass the DeepFacade method of the first version in different indexes. Compared with the DeepFacade of the second version, the method of the embodiment of the application obtains the comparable average pixel precision and greatly exceeds the related method in the intersection-to-parallel ratio evaluation index.

As shown in fig. 19, in the method provided by the embodiment of the present application, compared with the qualitative analysis result of the DeepFacade of the two versions visually, the facade analysis network provided by the embodiment of the present application generates a more regular and accurate building facade analysis result, and the generated parameterized analysis result reasonably expresses mutually overlapped or nested facade element regions. Although the area where the window and the balcony are overlapped has complex textures, the facade analysis network provided by the embodiment of the application still predicts the complete window and balcony objects, which is very important for building a facade model.

As shown in table 3, in the comparison of the quantitative comparison results of the EACM context aggregation module and the RCCA method on the CMP data set provided in the embodiment of the present application, in the experiment, the element arrangement context module in fig. 14 is replaced with the RCCA modules with cycle numbers of 1 and 2, and the results corresponding to R of 1 and 2 in table 3 are obtained respectively.

Table 3 quantitative evaluation of different contextual aggregation methods on CMP datasets

Method	AP(％)	AP⁵⁰(％)	AP⁷⁵(％)
				Base line	39.7	67.9	41.0
+RCCA(R＝1)	39.7	68.4	40.7
				+RCCA(R＝2)	39.8	68.3	41.2
+EACM	40.2	68.4	42.3

As can be seen from table 3, compared with the context aggregation method in the related art, the method provided in the embodiment of the present application can effectively improve the overall performance of building facade analysis.

As shown in the accuracy-recall curve under different intersection-ratio thresholds in fig. 20, the threshold is set from 0.5 to 0.9 by taking 0.1 as a step length, and under different thresholds, the methods provided by the embodiments of the present application all exhibit significant performance improvement effects.

As shown in table 4, the table further includes a performance improvement effect of the EACM provided in the embodiment of the present application on the analysis of each type of element, according to the analysis result of the method provided in the embodiment of the present application on different facade element types. The turnover test is a data enhancement method during testing, and the analysis results of the original image and the horizontally turned image are integrated to be used as the final output result.

Table 4 analysis results of different types of facade elements by the method provided in the embodiment of the present application

As can be seen from table 4, after the EACM provided in the embodiment of the present application is used, the overall performance of the resolver and the resolution results of different types of facade elements are both significantly improved, and especially for the types of facade elements with a regular spatial arrangement, the average accuracy is greatly improved.

In summary, the feature extraction network in the embodiment of the present application is mainly used for extracting image features from building facade images, and the element arrangement context module is used for capturing and aggregating spatial row context and column context from the image features, so that the image feature extraction part may select an appropriate convolutional network structure, such as AlexNet, ResNet101, ResNet152, UNet, and the like, according to hardware registration and computation speed requirements. The enhanced feature map is fed into a detector head for predicting the parameterized bounding box and obtaining a final analysis result, and the detector head can be any detection method with the capability of predicting parameters of the bounding box through processing depth features, namely, the element arrangement context module provided by the embodiment of the application is plug-and-play, can be used in a novel detection framework along with the development of the detection method, and has good applicability.

The artificial intelligence based image processing method provided by the embodiment of the present application has been described in conjunction with exemplary applications and implementations of the electronic device provided by the embodiment of the present application. In practical applications, each functional module in the image processing apparatus based on artificial intelligence may be cooperatively implemented by hardware resources of an electronic device (such as a terminal, a server, or a server cluster), such as computing resources of a processor, communication resources (for example, used to support various modes of implementing communications such as optical cables and cells), and a memory. Fig. 2 shows an artificial intelligence based image processing apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, for example, software modules designed by a programming language such as C/C + +, Java, application software designed by a programming language such as C/C + +, Java, or dedicated software modules in a large software system, application program interfaces, plug-ins, cloud services, and other implementations, which are exemplified below.

The artificial intelligence based image processing apparatus 555 includes a series of modules, including a feature extraction module 5551, a first aggregation module 5552, a second aggregation module 5553, a fusion module 5554, a detection module 5555, and a training module 5556. The following continues to describe the scheme for implementing image processing by cooperation of the modules in the artificial intelligence based image processing apparatus 555 according to the embodiment of the present application.

The feature extraction module 5551 is configured to perform feature extraction processing on an image including a building structure to obtain an initial feature of the building structure; a first aggregation module 5552, configured to perform a column space-based aggregation process on the initial features of the building structure to obtain column context features of the building structure; the second aggregation module 5553 is configured to perform aggregation processing based on a line space on the initial feature of the building structure to obtain a line context feature of the building structure; a fusion module 5554, configured to perform fusion processing based on the context characteristics of the building structure, to obtain enhanced characteristics of the building structure; a detecting module 5555, configured to perform facade element detection processing based on the enhanced features of the building structure, so as to obtain position information of a facade element in the building structure.

In some embodiments, the first aggregation module 5552 is further configured to perform column attention processing on the initial features of the building structure to obtain a column attention map of the building structure; and performing context aggregation processing based on the column attention diagrams of the building structure to obtain column context characteristics of the building structure.

In some embodiments, the first aggregation module 5552 is further configured to perform a value-based feature mapping process on the initial feature of the building structure to obtain a value feature map of the building structure; performing column feature extraction processing on the value feature map of the building structure to obtain column features in the value feature map; and weighting the column features in the value feature map based on the column attention map of the building structure to obtain the column context features of the building structure.

In some embodiments, the first aggregation module 5552 is further configured to perform a mapping process based on a query feature on the initial feature of the building structure, so as to obtain a query feature map of the building structure; mapping processing based on key features is carried out on the initial features of the building structure to obtain a key feature map of the building structure; and performing column correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a column attention map of the building structure.

In some embodiments, the first aggregation module 5552 is further configured to perform a column feature extraction process on the key feature map of the building structure, so as to obtain column features of the key feature map; and carrying out correlation processing on the query feature map of the building structure based on the column features of the key feature map to obtain a column attention map of the building structure.

In some embodiments, the first aggregation module 5552 is further configured to perform the following for any one of a plurality of locations in the query feature map: determining a query feature vector for the location based on a query feature map of the building structure; determining a column feature vector for the location based on column features of the key feature map; determining an attention weight for the location based on the query feature vector for the location and the column feature vector for the location; and combining the attention weights corresponding to the plurality of positions to obtain the column attention diagram of the building structure.

In some embodiments, the second aggregation module 5553 is further configured to perform line attention processing on the initial feature of the building structure to obtain a line attention map of the building structure; and performing context aggregation processing based on the row attention diagram of the building structure to obtain the column context characteristics of the building structure.

In some embodiments, the fusion module 5554 is further configured to perform a splicing process on the context features of the building structure, to obtain context features of the building structure; mapping the context characteristics of the building structure to obtain the mapping characteristics of the building structure; and adding the mapping characteristics of the building structure and the initial characteristics of the building structure to obtain the enhanced characteristics of the building structure.

In some embodiments, the detecting module 5555 is further configured to perform center point prediction processing based on a facade element on the enhanced feature of the building structure, so as to obtain center point information of the facade element in the building structure; carrying out size prediction processing based on a facade element on the enhanced feature of the building structure to obtain size information of the facade element in the building structure; and determining the position information of the facade elements in the building structure based on the central point information of the facade elements in the building structure and the size information.

In some embodiments, the detection module 5555 is further configured to perform an offset prediction process based on a facade element on the enhanced feature of the building structure, so as to obtain offset information of the facade element in the building structure; adding the offset information of the facade elements in the building structure and the central point information to obtain standard central point information of the facade elements in the building structure; and determining the position information of the facade elements in the building structure based on the standard central point information of the facade elements in the building structure and the size information.

In some embodiments, the image processing method is implemented by calling a neural network model; the device further comprises: the training module 5556 is configured to perform, through the initialized neural network model, a facade element prediction processing on an image sample including a building structure, to obtain predicted position information of a facade element in the image sample; constructing a position loss function of the neural network model based on the predicted position information and the position labels of the vertical elements in the image sample; and updating parameters of the neural network model based on the position loss function, and taking the updated parameters of the neural network model as the parameters of the trained neural network model.

In some embodiments, the predicted position information is characterized by predicted center point information, predicted offset information, and predicted size information; the training module 5556 is further configured to construct a center point loss function of the neural network model based on the predicted center point information and the center point labels of the facade elements in the image sample; constructing an offset loss function of the neural network model based on the predicted offset information and offset labels of the facade elements in the image sample; constructing a size loss function of the neural network model based on the predicted size information and the size labels of the facade elements in the image sample; and carrying out weighted summation processing on the central point loss function, the offset loss function and the size loss function to obtain a position loss function of the neural network model.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence based image processing method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform an artificial intelligence based image processing method provided by embodiments of the present application, for example, the artificial intelligence based image processing method as shown in fig. 3A-3C.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An artificial intelligence based image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein said performing a column space based aggregation of initial features of the building structure to obtain column context features of the building structure comprises:

performing column attention processing on the initial features of the building structure to obtain a column attention diagram of the building structure;

3. The method of claim 2, wherein the performing a context aggregation process based on the column attention map of the building structure to obtain column context features of the building structure comprises:

carrying out mapping processing based on value characteristics on the initial characteristics of the building structure to obtain a value characteristic diagram of the building structure;

4. The method of claim 2, wherein the performing a column attention process on the initial features of the building structure to obtain a column attention map of the building structure comprises:

mapping processing based on query features is carried out on the initial features of the building structure to obtain a query feature map of the building structure;

5. The method of claim 4, wherein the performing column correlation processing based on the query feature map of the building structure and the key feature map of the building structure to obtain a column attention map of the building structure comprises:

performing column characteristic extraction processing on the key characteristic diagram of the building structure to obtain column characteristics of the key characteristic diagram;

6. The method of claim 5, wherein the correlating the query feature map of the building structure based on the column features of the key feature map to obtain a column attention map of the building structure comprises:

performing the following for any one of a plurality of locations in the query feature map:

7. The method of claim 1, wherein the fusing based on the column context features and the row context features of the building structure to obtain the enhanced features of the building structure comprises:

splicing the upper and lower context features and the row context features of the building structure to obtain the context features of the building structure;

8. The method according to claim 1, wherein the performing facade element detection processing based on the enhanced features of the building structure to obtain position information of the facade elements in the building structure comprises:

performing central point prediction processing based on a facade element on the enhanced feature of the building structure to obtain central point information of the facade element in the building structure;

9. The method of claim 8, wherein determining the location information of the facade element in the building structure based on the center point information of the facade element in the building structure and the size information comprises:

performing offset prediction processing based on the facade elements on the enhanced features of the building structure to obtain offset information of the facade elements in the building structure;

10. The method of claim 1, wherein the image processing method is implemented by calling a neural network model; the training process of the neural network model comprises the following steps:

performing elevation element prediction processing on an image sample comprising a building structure through the initialized neural network model to obtain predicted position information of elevation elements in the image sample;

11. The method of claim 10,

the predicted position information is characterized by predicted center point information, predicted offset information and predicted size information;

constructing a position loss function of the neural network model based on the predicted position information and the position labels of the vertical elements in the image sample, including:

constructing a central point loss function of the neural network model based on the predicted central point information and a central point label of a facade element in the image sample;

12. An artificial intelligence-based image processing apparatus, characterized in that the apparatus comprises:

the fusion module is used for carrying out fusion processing on the initial features, the column context features and the row context features of the building structure to obtain the enhanced features of the building structure;

13. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based image processing method of any one of claims 1 to 11 when executing executable instructions stored in the memory.

14. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based image processing method of any one of claims 1 to 11 when executed by a processor.

15. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the artificial intelligence based image processing method of any one of claims 1 to 11.