CN116704122A - Non-visual field imaging method, device, equipment and medium based on attention mechanism - Google Patents

Non-visual field imaging method, device, equipment and medium based on attention mechanism Download PDF

Info

Publication number
CN116704122A
CN116704122A CN202310596556.0A CN202310596556A CN116704122A CN 116704122 A CN116704122 A CN 116704122A CN 202310596556 A CN202310596556 A CN 202310596556A CN 116704122 A CN116704122 A CN 116704122A
Authority
CN
China
Prior art keywords
feature
shallow
features
layer
deep
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310596556.0A
Other languages
Chinese (zh)
Inventor
熊志伟
李越
张越一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310596556.0A priority Critical patent/CN116704122A/en
Publication of CN116704122A publication Critical patent/CN116704122A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/08Indexing scheme for image data processing or generation, in general involving all processing steps from image acquisition to 3D model generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)

Abstract

The present disclosure provides a non-field of view imaging method, apparatus, device and medium based on an attention mechanism, the method comprising: acquiring transient data related to an object to be measured; shallow feature extraction is carried out on transient data by using a shallow feature extraction network, so that a first shallow feature and a second shallow feature are obtained; performing feature extraction on the first shallow features by utilizing different feature extraction branches of the space-time self-attention network to obtain middle local features and middle global features; performing feature fusion on the middle layer local features and the middle layer global features by utilizing different feature fusion branches of the space-time cross attention network to obtain deep local features and deep global features; and fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object.

Description

Non-visual field imaging method, device, equipment and medium based on attention mechanism
Technical Field
The present disclosure relates to the field of computer vision and the field of data processing technologies, and in particular, to a non-visual field imaging method, apparatus, device, and medium based on an attention mechanism.
Background
Traditional imaging methods have focused mainly on recovering information within the line of sight. In the range of vision of the traditional imaging method, no obstacle exists on the path between the measured target and the camera of the acquisition device. In contrast, the hidden scene recovered by the non-field of view imaging technique is beyond the line of sight of the acquisition device camera. Non-field of view imaging techniques may utilize diffuse reflected relay surface scattered light from a hidden scene to effect non-field of view imaging of the hidden scene. In recent years, the non-visual field imaging technology brings great reform to the fields of automatic driving, disaster relief, medical diagnosis and the like.
In the related art, a non-view imaging method based on filtered back projection or optical path transmission generally applies a limiting condition, such as an ideal diffuse reflection surface, no shielding behind a relay wall, and the like, so that loss and serious noise exist in an imaging detail texture imaged by the non-view imaging method based on filtered back projection or optical path transmission. The method based on wave propagation is sensitive to the depth range of hidden objects in a hidden scene, and is difficult to accurately recover the far-away area of the hidden object with larger depth.
Recently, a deep learning-based method was introduced into a non-field-of-view imaging technique, and in the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: in a practical complex scene, the quality of imaging a complex hidden scene by a non-visual field imaging technology based on deep learning cannot meet the requirements of practical application.
Disclosure of Invention
In view of the foregoing, the present disclosure provides attention-based non-field-of-view imaging methods, apparatus, devices, and media.
According to a first aspect of the present disclosure, there is provided a non-field of view imaging method based on an attention mechanism, comprising:
acquiring transient data related to an object to be measured, wherein the object to be measured is positioned in a non-visual field, and the transient data represents data composed of photon information of photons diffusely reflected by an intermediate wall and reflected by the object to be measured;
carrying out shallow feature extraction on the transient data by using a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, wherein the size of the first shallow feature is smaller than that of the second shallow feature;
performing feature extraction on the first shallow features by using different feature extraction branches of a space-time self-attention network to obtain middle-layer local features and middle-layer global features, wherein the different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and the size of the middle-layer local features is larger than that of the middle-layer global features;
performing feature fusion on the middle local feature and the middle global feature by utilizing different feature fusion branches of the space-time cross attention network to obtain a deep local feature and a deep global feature, wherein the different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism;
And fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object.
According to an embodiment of the present disclosure, the spatio-temporal self-attention network includes a spatio-temporal local feature encoder and a spatio-temporal global feature encoder, and the feature extracting the first shallow feature by using different feature extraction branches of the spatio-temporal self-attention network, to obtain a middle local feature and a middle global feature includes:
extracting features of the first shallow feature map by using the space-time local feature encoder to obtain the middle layer local features;
downsampling the first shallow layer feature to obtain a downsampled feature;
and extracting the characteristics of the downsampled characteristics by using the space-time global characteristic encoder to obtain the middle-layer global characteristics.
According to an embodiment of the present disclosure, the space-time cross attention network includes a local cross attention network and a global cross attention network, the feature fusion of the middle local feature and the middle global feature by the different feature fusion branches of the space-time cross attention network to obtain deep local features and deep global features includes:
Upsampling the middle layer global feature to obtain an upsampled feature, wherein the upsampled feature has the same size as the middle layer local feature;
the local cross attention network is utilized to perform feature fusion on the upsampling feature and the middle layer local feature to obtain the deep local feature, wherein at least one feature extraction layer based on an attention mechanism, which is included in the local cross attention network, takes the upsampling feature as a query value and takes the middle layer local feature as a key value and an index value;
and carrying out feature fusion on the upsampling features and the middle-layer local features by using the global cross attention network to obtain the deep global features, wherein at least one feature extraction layer based on an attention mechanism, which is included in the global cross attention network, takes the middle-layer local features as query values and the upsampling features as key values and index values.
According to an embodiment of the present disclosure, the deep shallow feature fusion network includes N 1 Up-sampling layer, N 2 Convolutional layers and N 3 A nonlinear layer, wherein N is as described above 1 、N 2 And N 3 Are integers of 1 or more.
According to an embodiment of the present disclosure, the shallow feature extraction network includes a feature extraction layer, a feature conversion layer, and a feature enhancement layer.
According to an embodiment of the present disclosure, the feature extraction layer includes N 4 Downsampling layers, N 5 Convolutional layers and N 6 A nonlinear layer, wherein N is as described above 4 、N 5 And N 6 Are integers greater than or equal to 1;
the feature conversion layer comprises at least one feature extraction layer based on a traditional non-visual field imaging algorithm;
the characteristic enhancement layer comprises N 7 Downsampling layers, N 8 Convolutional layers, N 9 A nonlinear layer, wherein N is as described above 7 、N 8 And N 9 Are integers of 1 or more.
According to an embodiment of the present disclosure, the fusing the second shallow feature, the deep local feature, and the deep global feature by using a deep shallow feature fusion network to obtain a three-dimensional voxel, a depth image, and a luminance image corresponding to the measured object includes:
fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels corresponding to the measured object;
projecting the three-dimensional voxels along a depth axis to obtain the brightness image corresponding to the projection maximum value;
And obtaining the depth image according to the position information of the projection maximum value.
A second aspect of the present disclosure provides a non-field of view imaging apparatus based on an attention mechanism, comprising:
the acquisition module is used for acquiring transient data related to the measured object, wherein the measured object is positioned in a non-visual field, and the transient data represents data formed by photon information of photons diffusely reflected by an intermediate wall and reflected by the measured object;
the first obtaining module is used for carrying out shallow feature extraction on the transient data by utilizing a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, wherein the size of the first shallow feature is smaller than that of the second shallow feature;
the second obtaining module is used for carrying out feature extraction on the first shallow features by utilizing different feature extraction branches of the space-time self-attention network to obtain middle-layer local features and middle-layer global features, wherein the different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and the size of the middle-layer local features is larger than that of the middle-layer global features;
the third obtaining module is used for carrying out feature fusion on the middle-layer local features and the middle-layer global features by utilizing different feature fusion branches of the space-time cross attention network to obtain deep local features and deep global features, wherein the different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism;
And a fourth obtaining module, configured to fuse the second shallow feature, the deep local feature, and the deep global feature by using a deep shallow feature fusion network, so as to obtain a three-dimensional voxel, a depth image, and a brightness image corresponding to the measured object.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.
According to the non-visual field imaging method based on the attention mechanism, shallow feature extraction is carried out on transient data by utilizing a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, then a space-time self-attention network and a space-time cross-attention network based on the attention mechanism are sequentially utilized according to the first shallow feature and the second shallow feature to capture local and global correlations of original transient data related to a measured object in a non-visual field to obtain deep local features and deep global features including local and global correlations, and then a deep shallow feature fusion network is utilized to fuse the second shallow feature, the deep local features and the deep global features to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object, and the quality of the three-dimensional voxels, the depth images and the brightness images corresponding to the measured object is improved, so that the three-dimensional voxels, the depth images and the brightness images corresponding to the measured object can meet the requirements of practical application.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a non-field of view imaging method based on an attention mechanism in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of a non-field of view acquisition device of a non-field of view imaging method based on an attention mechanism in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a non-field of view imaging method based on an attention mechanism in accordance with another embodiment of the present disclosure;
FIG. 5 schematically illustrates a luminance image and depth image schematic obtained by a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates an overall image schematic resulting from a non-field of view imaging method based on an attention mechanism in accordance with an embodiment of the present disclosure;
fig. 7 schematically illustrates a block diagram of a non-field of view imaging device based on an attention mechanism in accordance with an embodiment of the present disclosure; and
Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement an attention-based non-field of view imaging method in accordance with an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.
In a practical complex scene, the quality of imaging a complex hidden scene by a non-visual field imaging technology based on deep learning cannot meet the requirements of practical application. In order to at least partially solve the technical problems in the related art, embodiments of the present disclosure provide a non-visual field imaging method, apparatus, device and medium based on an attention mechanism, which may be applied to the technical field of computer vision and the technical field of data processing.
Embodiments of the present disclosure provide a non-field of view imaging method based on an attention mechanism, comprising: acquiring transient data related to the measured object, wherein the measured object is positioned in a non-visual field, and the transient data represents data composed of photon information of photons diffusely reflected by an intermediate wall and reflected by the measured object; carrying out shallow feature extraction on the transient data by using a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, wherein the size of the first shallow feature is smaller than that of the second shallow feature; performing feature extraction on the first shallow features by utilizing different feature extraction branches of the space-time self-attention network to obtain middle-layer local features and middle-layer global features, wherein the different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and the size of the middle-layer local features is larger than that of the middle-layer global features; performing feature fusion on the middle layer local features and the middle layer global features by utilizing different feature fusion branches of the space-time cross attention network to obtain deep local features and deep global features, wherein different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism; and fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object.
Fig. 1 schematically illustrates an application scenario diagram of a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.
The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the non-visual field imaging method based on the attention mechanism provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the attention-based non-field of view imaging devices provided by embodiments of the present disclosure may be generally disposed in the server 105. The attention-based non-view imaging method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the attention-based non-visual field imaging apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The attention-based non-field-of-view imaging method of the disclosed embodiments will be described in detail below with reference to the scenario described in fig. 1, by way of fig. 2-6.
Fig. 2 schematically illustrates a flow chart of a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure.
As shown in fig. 2, the non-visual field imaging method based on the attention mechanism of this embodiment includes operations S210 to S250.
In operation S210, transient data related to the object to be measured is acquired, wherein the object to be measured is located in a non-viewing area, and the transient data characterizes data composed of photon information of photons diffusely reflected by the intermediate wall and reflected by the object to be measured.
According to the embodiment of the disclosure, the non-visual field simulation equipment can be utilized to simulate the hidden scene to obtain transient data, and the single photon receiver can also be utilized to receive photons diffusely reflected by the intermediate wall and reflected by the measured object to obtain the transient data.
According to embodiments of the present disclosure, the transient data may include the number of photons reflected by the measured object, the propagation time of the photons, and measured position information.
In operation S220, shallow feature extraction is performed on the transient data by using the shallow feature extraction network, so as to obtain a first shallow feature and a second shallow feature, where the size of the first shallow feature is smaller than that of the second shallow feature.
According to the embodiment of the disclosure, the shallow feature extraction network can be utilized to extract shallow features of transient data to obtain first shallow features, and then the first shallow features are up-sampled to obtain second shallow features. The transient data can be subjected to shallow feature extraction by using a shallow feature extraction network to obtain a second shallow feature, and then the second shallow feature is subjected to downsampling to obtain a first shallow feature. And carrying out shallow feature extraction on the transient data by using a shallow feature extraction network, wherein the mode of obtaining the first shallow feature and the second shallow feature can be selected according to actual service requirements.
According to the embodiment of the disclosure, shallow feature extraction is performed on transient data by using a shallow feature extraction network, so that initial feature first shallow features and second shallow features related to the transient data are obtained, and preparation is made for extracting associated features in the initial features by using a self-attention mechanism-based network later.
In operation S230, feature extraction is performed on the first shallow features by using different feature extraction branches of the spatio-temporal self-attentive network to obtain middle local features and middle global features, where the different feature extraction branches of the spatio-temporal self-attentive network respectively include at least one feature extraction layer based on an attentive mechanism, and a size of the middle local features is larger than a size of the middle global features.
According to an embodiment of the present disclosure, the number of feature extraction layers based on an attention mechanism included in different feature extraction branches of the spatio-temporal self-attention network may be, for example, 1, 2 or 3, etc., and the number of feature extraction layers based on an attention mechanism included in different feature extraction branches of the spatio-temporal self-attention network may be selected according to actual service requirements, and the embodiment of the present disclosure does not limit the number of feature extraction layers based on an attention mechanism.
According to the embodiment of the disclosure, the first shallow features are extracted by utilizing different feature extraction branches of the space-time self-focusing network to obtain the middle-layer local features and the middle-layer global features, the different feature extraction branches of the space-time self-focusing network respectively comprise at least one feature extraction layer based on a focusing mechanism, so that the space-time self-focusing network respectively segments the first shallow features in a time dimension and a space dimension in the process of obtaining the middle-layer local features, and then the self-focusing mechanism calculation is carried out together to obtain the middle-layer local features, so that the middle-layer local features can keep the consistency and consistency of the depth and the brightness of a measured object in a local area, sense more local details of the measured object, and respectively carry out self-focusing calculation on the first shallow features as a whole along the time dimension and the space dimension in the process of obtaining the middle-layer global features by the space-time self-focusing network to sense the features of the measured object with more depth.
In operation S240, feature fusion is performed on the middle layer local feature and the middle layer global feature by using different feature fusion branches of the spatio-temporal cross attention network to obtain a deep local feature and a deep global feature, where different feature extraction branches of the spatio-temporal cross attention network respectively include at least one feature extraction layer based on an attention mechanism.
According to the embodiment of the disclosure, a plurality of input parameters of the space-time cross attention network are provided, and in the process of carrying out feature fusion on the middle local feature and the middle global feature by utilizing different feature fusion branches of the space-time cross attention network to obtain the deep local feature and the deep global feature, the middle local feature and the middle global feature are used as the input parameters of the space-time cross attention network, and the corresponding relations between the middle local feature and the middle global feature and the input parameters of the space-time cross attention network are different.
According to an embodiment of the present disclosure, the number of feature extraction layers based on an attention mechanism included in different feature extraction branches of the spatiotemporal cross-attention network may be, for example, 1, 2 or 3, etc., and the number of feature extraction layers based on an attention mechanism included in different feature extraction branches of the spatiotemporal cross-attention network may be selected according to actual service requirements, and the embodiment of the present disclosure does not limit the number of feature extraction layers based on an attention mechanism.
According to the embodiment of the disclosure, the deep local features and the deep global features are obtained by carrying out feature fusion on the middle local features and the middle global features by utilizing different feature fusion branches of the space-time cross attention network, wherein different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and deep local features which are fully fused with the local features contained in the middle local features and the middle global features and deep global features which are fully fused with the global features contained in the middle local features and the middle global features are obtained.
In operation S250, the second shallow feature, the deep local feature, and the deep global feature are fused by using the deep shallow feature fusion network, so as to obtain a three-dimensional voxel, a depth image, and a brightness image corresponding to the measured object.
According to the embodiment of the disclosure, the second shallow features, the deep local features and the deep global features are fused by utilizing the deep shallow feature fusion network to obtain the three-dimensional voxel, the depth image and the brightness image corresponding to the measured object, so that the three-dimensional voxel, the depth image and the brightness image corresponding to the measured object are obtained according to the deep features and the shallow features which are fully fused, the quality of the three-dimensional voxel, the depth image and the brightness image corresponding to the measured object is improved, and the three-dimensional voxel, the depth image and the brightness image corresponding to the measured object can meet the requirements of practical application.
According to the non-visual field imaging method based on the attention mechanism, shallow feature extraction is carried out on transient data by utilizing a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, then a space-time self-attention network and a space-time cross-attention network based on the attention mechanism are sequentially utilized according to the first shallow feature and the second shallow feature to capture local and global correlations of original transient data related to a measured object in a non-visual field to obtain deep local features and deep global features including local and global correlations, and then a deep shallow feature fusion network is utilized to fuse the second shallow feature, the deep local features and the deep global features to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object, and the quality of the three-dimensional voxels, the depth images and the brightness images corresponding to the measured object is improved, so that the three-dimensional voxels, the depth images and the brightness images corresponding to the measured object can meet the requirements of practical application.
Fig. 3 schematically illustrates a schematic diagram of a non-field of view acquisition device of a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure.
As shown in fig. 3, the non-view acquisition device includes a laser 301, a beam splitter 302, a galvanometer 303, a lens 304, an optical fiber, a single photon receptor (SPAD) 305.
As shown in fig. 3, there is a barrier 307 between the object rabbit 306 and the single photon receptor 306 in the hidden scene. The process of transient data acquisition of the measured object rabbit 306 in the hidden scene by using the non-visual field acquisition device shown in fig. 3 is as follows: the laser 301 emits short pulse light, which passes through the beam splitter 302 and the galvanometer 303 and reaches the P of the intermediate wall 308 l At the position, then diffusely reflected by the intermediate wall 308, reaches the measured object rabbit 306, and then reflected by the measured object rabbit 306, reaches P of the intermediate wall 308 s The position is diffusely reflected by the intermediate wall 308, and then sequentially passes through the galvanometer, the lens 304 and the optical fiber to reach the single photon receptor (SPAD) 305, so as to obtain transient data τ received by the single photon receptor (SPAD) 305.
According to an embodiment of the present disclosure, a shallow feature extraction network includes a feature extraction layer, a feature conversion layer, and a feature enhancement layer.
According to an embodiment of the present disclosure, the feature extraction layer includes N 4 Downsampling layers, N 5 Convolutional layers and N 6 A nonlinear layer, wherein N 4 、N 5 And N 6 Are integers greater than or equal to 1;
the feature conversion layer comprises at least one feature extraction layer based on a traditional non-visual field imaging algorithm;
the feature enhancement layer includes N 7 Downsampling layers, N 8 Convolutional layers, N 9 A nonlinear layer, wherein N 7 、N 8 And N 9 Are integers of 1 or more.
According to an embodiment of the present disclosure, N 4 、N 5 、N 6 、N 7 、N 8 And N 9 Can be selected according to actual service conditions, and the embodiment of the disclosure does not aim at N 4 、N 5 、N 6 、N 7 、N 8 And N 9 Limiting.
According to embodiments of the present disclosure, the shallow feature extraction network includes feature extraction layers that may include, for example, 1 downsampling layer, 3 convolution layers, and 3 non-linear layers (relus).
According to the embodiment of the present disclosure, the conventional non-field of view imaging algorithm may be selected according to actual service conditions, and the embodiment of the present disclosure does not limit the conventional non-field of view imaging algorithm.
According to embodiments of the present disclosure, the shallow feature extraction network may include a feature transformation layer, such as a conventional non-visual field imaging algorithm f-k migration (FK, wave-based non-line-of-sight imaging using fast FK migra-tion).
According to embodiments of the present disclosure, the shallow feature extraction network may include feature enhancement layers including, for example, 2 downsampling layers, 4 convolution layers, 5 non-linear layers (ReLU).
According to the embodiment of the disclosure, the feature conversion layer included in the shallow feature extraction network comprises at least one feature extraction layer based on a traditional non-visual field imaging algorithm, so that the initial feature first shallow feature and the initial feature second shallow feature obtained by the shallow feature extraction network are more accurate.
According to an embodiment of the present disclosure, a spatio-temporal self-attention network includes a spatio-temporal local feature encoder and a spatio-temporal global feature encoder, and for operation S230 shown in fig. 2, performing feature extraction on a first shallow feature by using different feature extraction branches of the spatio-temporal self-attention network to obtain a middle local feature and a middle global feature, may include the following operations:
performing feature extraction on the first shallow feature map by using a space-time local feature encoder to obtain middle-layer local features;
downsampling the first shallow layer feature to obtain a downsampled feature;
and extracting the characteristics of the downsampled characteristics by using a space-time global characteristic encoder to obtain middle-layer global characteristics.
According to embodiments of the present disclosure, the spatio-temporal local feature encoder may include, for example, a attention-based mechanism (transducer) feature extraction layer, and the spatio-temporal global feature encoder may include, for example, a transducer-based feature extraction layer.
According to the embodiment of the disclosure, the method for downsampling the first shallow features may be selected according to actual service requirements, and the embodiment of the disclosure does not limit the downsampling method. The downsampling method may be, for example, a bilinear difference algorithm.
According to the embodiment of the disclosure, as different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, a space-time local feature encoder is utilized to perform feature extraction on a first shallow feature map to obtain a middle layer local feature, so that the space-time local feature encoder can respectively segment the first shallow feature in a time dimension and a space dimension in the process of obtaining the middle layer local feature, and then self-attention mechanism calculation is performed together to obtain the middle layer local feature, so that the middle layer local feature can keep the consistency and consistency of the depth and brightness of a measured object in a local area, and more local details of the measured object are perceived.
According to the embodiment of the disclosure, since different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, a downsampled feature is obtained by downsampling a first shallow feature, a middle-layer global feature is obtained by performing feature extraction on the downsampled feature by using a space-time global feature encoder, so that the space-time global feature encoder performs self-attention calculation on the first shallow feature as a whole along a time dimension and a space dimension in the process of obtaining the middle-layer global feature, and more features of a measured object with larger depth are perceived.
According to an embodiment of the present disclosure, the spatiotemporal cross-attention network includes a local cross-attention network and a global cross-attention network, and for operation S240 shown in fig. 2, feature fusion is performed on the middle local feature and the middle global feature by using different feature fusion branches of the spatiotemporal cross-attention network to obtain a deep local feature and a deep global feature, which may include the following operations:
upsampling the middle-layer global feature to obtain an upsampled feature, wherein the size of the upsampled feature is the same as the size of the middle-layer local feature;
carrying out feature fusion on the up-sampling features and the middle-layer local features by utilizing a local cross attention network to obtain deep local features, wherein at least one feature extraction layer based on an attention mechanism, which is included in the local cross attention network, takes the up-sampling features as query values and takes the middle-layer local features as key values and index values;
and carrying out feature fusion on the up-sampling features and the middle-layer local features by using a global cross attention network to obtain deep global features, wherein at least one feature extraction layer based on an attention mechanism, which is included in the global cross attention network, takes the middle-layer local features as query values and takes the up-sampling features as key values and index values.
According to the embodiment of the disclosure, the upsampling method for upsampling the middle-layer global feature may be selected according to the actual service requirement, and the embodiment of the disclosure does not limit the upsampling method. The upsampling method may be, for example, a bilinear difference algorithm.
According to embodiments of the present disclosure, the local cross-attention network may include, for example, an attention mechanism (transducer) based feature extraction layer, and the global cross-attention network may include, for example, an attention mechanism (transducer) based feature extraction layer.
According to the embodiment of the disclosure, as different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism, up-sampling is performed on global features of a middle layer to obtain up-sampling features, feature fusion is performed on the up-sampling features and local features of the middle layer by using the local cross attention network to obtain deep local features, and deep local features which are obtained by fully fusing the local features included in the local features of the middle layer and the global features of the middle layer are obtained.
According to the embodiment of the disclosure, as different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism, the deep global features are obtained by carrying out feature fusion on the upsampling features and the middle-layer local features by utilizing the global cross attention network, and the deep global features which are fully fused with the global features comprised by the middle-layer local features and the middle-layer global features are obtained.
According to an embodiment of the present disclosure, a deep shallow feature fusion network includes N 1 Up-sampling layer, N 2 Convolutional layers and N 3 A nonlinear layer, wherein N 1 、N 2 And N 3 Are integers of 1 or more.
According to an embodiment of the present disclosure, N 1 、N 2 And N 3 Can be selected according to actual service conditions, and the embodiment of the disclosure does not aim at N 1 、N 2 And N 3 Limiting.
In accordance with embodiments of the present disclosure, a deep shallow feature fusion network may include, for example, 2 upsampling layers, 2 convolution layers, and 2 nonlinear layers (relus).
According to an embodiment of the present disclosure, for operation S250 shown in fig. 2, fusing the second shallow feature, the deep local feature, and the deep global feature by using a deep shallow feature fusion network to obtain a three-dimensional voxel, a depth image, and a luminance image corresponding to the measured object may include the following operations:
fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels corresponding to the object to be measured;
projecting the three-dimensional voxels along a depth axis to obtain a brightness image corresponding to the maximum projection value;
and obtaining a depth image according to the position information of the projection maximum value.
According to the embodiment of the disclosure, the second shallow layer feature, the deep layer local feature and the deep layer global feature are fused by utilizing the deep layer shallow layer feature fusion network to obtain the three-dimensional voxel corresponding to the measured object, the three-dimensional voxel corresponding to the measured object is obtained according to the deep layer feature and the shallow layer feature which are fully fused, the quality of the three-dimensional voxel is improved, then the three-dimensional voxel with higher quality is projected along the depth axis to obtain the brightness image corresponding to the projection maximum value, the depth image is obtained according to the position information of the projection maximum value, the quality of the brightness image and the depth image is improved, and the three-dimensional voxel, the depth image and the brightness image corresponding to the measured object can meet the requirements of practical application.
In accordance with embodiments of the present disclosure, in training a network comprised by the attention-mechanism-based non-field of view imaging method provided by embodiments of the present disclosure, the loss of brightness L may be calculated using equation (1) I
L I =|I-I gt | (1)
Wherein I represents a predicted luminance image of the hidden scene, I gt The true luminance image of the hidden scene is characterized.
In accordance with embodiments of the present disclosure, in training a network comprised by the attention-mechanism-based non-field of view imaging method provided by embodiments of the present disclosure, depth loss L may be calculated using equation (2) D
L D =|D-D gt | (2)
Wherein D represents a predicted depth image of the hidden scene, D gt A true depth image of the hidden scene is characterized.
In accordance with an embodiment of the present disclosure, in training a network included in the attention mechanism-based non-visual field imaging method provided by an embodiment of the present disclosure, a total network loss L may be calculated using equation (3) total
L total =L I1 L D (3)
Wherein alpha is 1 Is a measure L I And L D Super parameters of corresponding proportion.
According to an embodiment of the present disclosure, α 1 Can be selected according to actual service demands, alpha 1 For example, it may be 1.
In accordance with embodiments of the present disclosure, in training a network comprised by a non-visual field imaging method based on an attention mechanism provided by embodiments of the present disclosure, to minimize a total loss L of the network total To that end, the network may be trained by Adam optimization algorithm and its network parameters updated until the total loss function L total Until convergence, a trained network is obtained. Under the condition that a trained network is obtained, the reconstruction of the three-dimensional voxel V, the depth image D and the brightness image I under the hidden scene can be realized by the trained network.
Fig. 4 schematically illustrates a flow chart of a non-field of view imaging method based on an attention mechanism according to another embodiment of the present disclosure.
As shown in fig. 4, the transient data τ "401" may be acquired according to the acquisition device shown in fig. 3, and then the transient data τ "401" is input into a shallow feature extraction network, and is sequentially processed by a feature extraction layer 411, a feature conversion layer 412 and a feature enhancement layer 413 of the shallow feature extraction network, so as to obtain a second shallow feature F S * Then for the second shallow feature F S * Downsampling to obtain a first shallow feature F S
As shown in fig. 4, a first shallow feature F S Input into a space-time local feature encoder 421 included in the space-time self-attention network, and the middle layer local feature F is calculated L . Will first shallow feature F S After bilinear difference downsampling, the obtained result is input into a space-time global feature encoder 422 to calculate a middle global feature map F G
As shown in fig. 4, a central global feature F G Upsampling (e.g. bilinear difference) to obtain upsampled feature F G Then, the middle layer local feature F L And upsampling feature F G Input local cross-attention network 431 will upsample feature F G As Query values (Query) for local cross-attention network 431, middle layer local feature F L As Key Value (Key) and index Value (Value) of local cross attention network 431, deep local special is calculated Sign F L *。
As shown in fig. 4, the middle layer local feature F L And upsampling feature F G Input global cross-attention network 432, intermediate layer local feature F L As a Query value (Query) for the global cross-attention network 432, the feature F will be upsampled G As Key Value (Key) and index Value (Value) of global cross-attention network 432, deep global feature F is calculated G *。
As shown in fig. 4, a second shallow feature F S * Deep local features F L * And deep global features F G * The three-dimensional voxels V, the depth image D and the brightness image I of the measured object 402 in the reconstructed hidden scene are calculated and obtained by inputting the three-dimensional voxels V, the depth image D and the brightness image I into the deep shallow feature fusion network 440.
According to embodiments of the present disclosure, in order to illustrate the effectiveness of the attention-based non-field of view imaging method provided by the embodiments of the present disclosure, the characteristics of depth and brightness of the network reconstructed object under test are respectively tested on the synthesized data. The method compared to the disclosed embodiments is mainly a conventional non-visual field algorithm: FBP (reverse ellipsoidal projection reconstruction), LCT (light cone transformation), FK (f-k migration), wave-based Non-line-of-sight imaging using fast FK migra-tion) and RSD (phase field method, non-line-of-sight imaging using phasor-field virtual Wave optics), deep learning algorithm: UNet (U-network method, deep Non-line-of-sight construction), LFE (feature embedding method Learned feature embed-dings for Non-line-of-sight imaging and recognition) and NeTF (neural network field method, non-line-of-sight imaging via neural transient fields).
Fig. 5 schematically illustrates a luminance image and a depth image schematic obtained by a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure.
Fig. 5 is a reconstruction visualization result of a non-visual field imaging method based on an attention mechanism and other contrast methods on synthesized data, where an odd line result is a luminance reconstruction result and an even line result is a depth reconstruction result, provided in an embodiment of the present disclosure.
As shown in fig. 5, our is a reconstructed visualization result of the attention-based non-field of view imaging method on the composite data provided by the embodiments of the present disclosure.
As can be seen from fig. 5, for the luminance image, the structures of the measured objects restored by FBP, LCT and UNet are blurred. FK and RSD restore the main structure of the object under test, but without detail. While LFE performs better than traditional non-view algorithms, details are left out. In contrast, the non-visual field imaging method based on the attention mechanism provided by the embodiment of the disclosure can restore the structure and details of the measured object at the same time.
From fig. 5, it can be seen that for the depth map, FBP, LCT, FK and LFE, it is difficult to reconstruct the details of the object under test, such as the wheels of the object under test motorcycle in fig. 5. RSD and UNet are difficult in restoring the main structure of the object under test. In contrast, the non-visual field imaging method based on the attention mechanism provided by the embodiment of the disclosure can reconstruct more details, especially in texture areas and remote areas.
Fig. 6 schematically illustrates an overall image schematic resulting from a non-field of view imaging method based on an attention mechanism according to an embodiment of the present disclosure.
Fig. 6 is a reconstructed visualization of a non-field of view imaging method and other contrast methods under a real imaging system provided by embodiments of the present disclosure based on an attention mechanism.
As shown in fig. 6, our is a reconstructed visualization result of the non-field of view imaging method based on the attention mechanism under a real imaging system provided by the embodiments of the present disclosure.
As can be seen from fig. 6, the non-visual field imaging method based on the attention mechanism provided by the embodiment of the disclosure achieves a good effect under the condition of clear detail processing and boundary of the hidden scene, especially for the beams, the bookshelf and the pedestrians of the object to be tested bicycle in fig. 6. FBP and LCT produce ambiguous results. FK and RSD can reconstruct the body structure of the object under test, but with a large noise. NeTF can only recover the rough shape of the object being measured. While LFE performs better than traditional non-view algorithms, details are left out.
As can be seen from fig. 5 and 6, the non-visual field imaging method based on the attention mechanism provided in the embodiments of the present disclosure can obtain the best reconstruction effect compared with the conventional non-visual field algorithm.
It should be noted that, unless there is an execution sequence between different operations or an execution sequence between different operations in technical implementation, the execution sequence between multiple operations may be different, and multiple operations may also be executed simultaneously in the embodiment of the disclosure.
Based on the non-visual field imaging method based on the attention mechanism, the disclosure also provides a non-visual field imaging device based on the attention mechanism. The device will be described in detail below in connection with fig. 7.
Fig. 7 schematically illustrates a block diagram of a non-field of view imaging device based on an attention mechanism according to an embodiment of the present disclosure.
As shown in fig. 7, the non-visual field imaging apparatus 700 based on the attention mechanism of this embodiment includes an acquisition module 710, a first obtaining module 720, a second obtaining module 730, a third obtaining module 740, and a fourth obtaining module 750.
An acquisition module 710 is configured to acquire transient data related to the object, where the object is located in a non-viewing area, and the transient data characterizes data composed of photon information of photons diffusely reflected by the intermediate wall and reflected by the object. In an embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein.
The first obtaining module 720 is configured to perform shallow feature extraction on the transient data by using a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, where a size of the first shallow feature is smaller than a size of the second shallow feature. In an embodiment, the first obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein.
A second obtaining module 730, configured to obtain a middle local feature and a middle global feature by performing feature extraction on the first shallow feature by using different feature extraction branches of the spatio-temporal self-attention network, where the different feature extraction branches of the spatio-temporal self-attention network respectively include at least one feature extraction layer based on an attention mechanism, and a size of the middle local feature is larger than a size of the middle global feature. In an embodiment, the second obtaining module 730 may be configured to perform the operation S230 described above, which is not described herein.
And a third obtaining module 740, configured to perform feature fusion on the middle local feature and the middle global feature by using different feature fusion branches of the spatio-temporal cross attention network to obtain a deep local feature and a deep global feature, where different feature extraction branches of the spatio-temporal cross attention network respectively include at least one feature extraction layer based on an attention mechanism. In an embodiment, the third obtaining module 740 may be configured to perform the operation S240 described above, which is not described herein.
A fourth obtaining module 750 is configured to fuse the second shallow feature, the deep local feature, and the deep global feature by using a deep shallow feature fusion network, so as to obtain a three-dimensional voxel, a depth image, and a brightness image corresponding to the measured object. In an embodiment, the fourth obtaining module 750 may be used to perform the operation S250 described above, which is not described herein.
According to embodiments of the present disclosure, any of the acquisition module 710, the first obtaining module 720, the second obtaining module 730, the third obtaining module 740, and the fourth obtaining module 750 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 710, the first acquisition module 720, the second acquisition module 730, the third acquisition module 740, and the fourth acquisition module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the acquisition module 710, the first acquisition module 720, the second acquisition module 730, the third acquisition module 740, and the fourth acquisition module 750 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.
Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement an attention-based non-field of view imaging method in accordance with an embodiment of the present disclosure.
As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.
In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to an input/output (I/O) interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to an input/output (I/O) interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium described above carries one or more programs which, when executed, implement a non-visual field imaging method based on an attention mechanism according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the attention-based mechanism non-field of view imaging method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (10)

1. A non-field of view imaging method based on an attention mechanism, comprising:
acquiring transient data related to a measured object, wherein the measured object is positioned in a non-visual field, and the transient data represents data composed of photon information of photons diffusely reflected by an intermediate wall and reflected by the measured object;
Carrying out shallow feature extraction on the transient data by using a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, wherein the size of the first shallow feature is smaller than that of the second shallow feature;
performing feature extraction on the first shallow features by utilizing different feature extraction branches of a space-time self-attention network to obtain middle-layer local features and middle-layer global features, wherein the different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and the size of the middle-layer local features is larger than that of the middle-layer global features;
performing feature fusion on the middle local feature and the middle global feature by utilizing different feature fusion branches of the space-time cross attention network to obtain a deep local feature and a deep global feature, wherein different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism;
and fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels, depth images and brightness images corresponding to the measured object.
2. The method of claim 1, wherein the spatio-temporal self-attention network comprises a spatio-temporal local feature encoder and a spatio-temporal global feature encoder, the feature extraction of the first shallow features using different feature extraction branches of the spatio-temporal self-attention network, obtaining middle layer local features and middle layer global features comprising:
performing feature extraction on the first shallow feature map by using the space-time local feature encoder to obtain the middle layer local feature;
downsampling the first shallow layer feature to obtain a downsampled feature;
and extracting the characteristics of the downsampled characteristics by using the space-time global characteristic encoder to obtain the middle-layer global characteristics.
3. The method of claim 1, wherein the spatiotemporal cross-attention network comprises a local cross-attention network and a global cross-attention network, the feature fusing the mid-level local features and the mid-level global features using different feature fusion branches of the spatiotemporal cross-attention network, the deriving deep local features and deep global features comprising:
upsampling the middle-layer global feature to obtain an upsampled feature, wherein the upsampled feature has the same size as the middle-layer local feature;
Performing feature fusion on the upsampling feature and the middle layer local feature by using the local cross attention network to obtain the deep local feature, wherein at least one feature extraction layer based on an attention mechanism, which is included in the local cross attention network, takes the upsampling feature as a query value and takes the middle layer local feature as a key value and an index value;
and carrying out feature fusion on the up-sampling features and the middle-layer local features by using the global cross attention network to obtain the deep global features, wherein at least one feature extraction layer based on an attention mechanism, which is included in the global cross attention network, takes the middle-layer local features as query values and takes the up-sampling features as key values and index values.
4. A method according to any one of claims 1 to 3, wherein the deep shallow feature fusion network comprises N 1 Up-sampling layer, N 2 Convolutional layers and N 3 A nonlinear layer, wherein the N 1 、N 2 And N 3 Are integers of 1 or more.
5. A method according to any one of claims 1 to 3, wherein the shallow feature extraction network comprises a feature extraction layer, a feature conversion layer and a feature enhancement layer.
6. The method of claim 5, wherein the feature extraction layer comprises N 4 Downsampling layers, N 5 Convolutional layers and N 6 A nonlinear layer, wherein the N 4 、N 5 And N 6 Are integers greater than or equal to 1;
the feature conversion layer comprises at least one feature extraction layer based on a traditional non-visual field imaging algorithm;
the feature enhancement layer includes N 7 Downsampling layers, N 8 Convolutional layers, N 9 A nonlinear layer, wherein the N 7 、N 8 And N 9 Are integers of 1 or more.
7. The method of claim 1, wherein the fusing the second shallow feature, the deep local feature, and the deep global feature with a deep shallow feature fusion network to obtain three-dimensional voxels, depth images, and luminance images corresponding to the object under test comprises:
fusing the second shallow features, the deep local features and the deep global features by using a deep shallow feature fusion network to obtain three-dimensional voxels corresponding to the measured object;
projecting the three-dimensional voxels along a depth axis to obtain the brightness image corresponding to the projection maximum value;
and obtaining the depth image according to the position information of the projection maximum value.
8. A non-field of view imaging apparatus based on an attention mechanism, comprising:
the system comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring transient data related to a detected object, the detected object is positioned in a non-visual field, and the transient data represents data formed by photon information of photons diffusely reflected by an intermediate wall and reflected by the detected object;
the first obtaining module is used for carrying out shallow feature extraction on the transient data by utilizing a shallow feature extraction network to obtain a first shallow feature and a second shallow feature, wherein the size of the first shallow feature is smaller than that of the second shallow feature;
the second obtaining module is used for carrying out feature extraction on the first shallow features by utilizing different feature extraction branches of the space-time self-attention network to obtain middle-layer local features and middle-layer global features, wherein the different feature extraction branches of the space-time self-attention network respectively comprise at least one feature extraction layer based on an attention mechanism, and the size of the middle-layer local features is larger than that of the middle-layer global features;
the third obtaining module is used for carrying out feature fusion on the middle local feature and the middle global feature by utilizing different feature fusion branches of the space-time cross attention network to obtain a deep local feature and a deep global feature, wherein the different feature extraction branches of the space-time cross attention network respectively comprise at least one feature extraction layer based on an attention mechanism;
And a fourth obtaining module, configured to fuse the second shallow feature, the deep local feature and the deep global feature by using a deep shallow feature fusion network, so as to obtain a three-dimensional voxel, a depth image and a brightness image corresponding to the measured object.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.
CN202310596556.0A 2023-05-23 2023-05-23 Non-visual field imaging method, device, equipment and medium based on attention mechanism Pending CN116704122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310596556.0A CN116704122A (en) 2023-05-23 2023-05-23 Non-visual field imaging method, device, equipment and medium based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310596556.0A CN116704122A (en) 2023-05-23 2023-05-23 Non-visual field imaging method, device, equipment and medium based on attention mechanism

Publications (1)

Publication Number Publication Date
CN116704122A true CN116704122A (en) 2023-09-05

Family

ID=87838444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310596556.0A Pending CN116704122A (en) 2023-05-23 2023-05-23 Non-visual field imaging method, device, equipment and medium based on attention mechanism

Country Status (1)

Country Link
CN (1) CN116704122A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912086A (en) * 2024-03-19 2024-04-19 中国科学技术大学 Face recognition method, system, equipment and medium based on broadcast-cut effect driving
CN117912086B (en) * 2024-03-19 2024-05-31 中国科学技术大学 Face recognition method, system, equipment and medium based on broadcast-cut effect driving

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912086A (en) * 2024-03-19 2024-04-19 中国科学技术大学 Face recognition method, system, equipment and medium based on broadcast-cut effect driving
CN117912086B (en) * 2024-03-19 2024-05-31 中国科学技术大学 Face recognition method, system, equipment and medium based on broadcast-cut effect driving

Similar Documents

Publication Publication Date Title
US10943145B2 (en) Image processing methods and apparatus, and electronic devices
Yang et al. Dense depth posterior (ddp) from single image and sparse range
CN112789631A (en) System and method for generating and transmitting image sequences based on sampled color information
KR20200087808A (en) Method and apparatus for partitioning instances, electronic devices, programs and media
CN116342452A (en) Image generation method and fusion imaging system
CN113421217A (en) Method and device for detecting travelable area
Cao et al. Computational framework for steady-state NLOS localization under changing ambient illumination conditions
Ren et al. Towards Efficient Video Detection Object Super‐Resolution with Deep Fusion Network for Public Safety
Li et al. Image super-resolution reconstruction based on multi-scale dual-attention
Li et al. Nlost: Non-line-of-sight imaging with transformer
Tang et al. Structure-embedded ghosting artifact suppression network for high dynamic range image reconstruction
EP4172876A1 (en) Processing perspective view range images using neural networks
Kim et al. Learning Structure for Concrete Crack Detection Using Robust Super‐Resolution with Generative Adversarial Network
Yao et al. 3D patch-based multi-view stereo for high-resolution imagery
Zhang et al. Pseudo-LiDAR point cloud magnification
CN116704122A (en) Non-visual field imaging method, device, equipment and medium based on attention mechanism
US20220406013A1 (en) Three-dimensional scene recreation using depth fusion
CN112052863B (en) Image detection method and device, computer storage medium and electronic equipment
Zhang et al. Local-linear-fitting-based matting for joint hole filling and depth upsampling of RGB-D images
CN116883770A (en) Training method and device of depth estimation model, electronic equipment and storage medium
Xu et al. Interactive algorithms in complex image processing systems based on big data
Kim et al. Bidirectional Deep Residual learning for Haze Removal.
Han et al. Aerial visible-to-infrared image translation: dataset, evaluation, and baseline
CN112991174A (en) Method and system for improving resolution of single-frame infrared image
Estrada et al. Multi-frame GAN-based machine learning image restoration for degraded visual environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination