CN114547364A

CN114547364A - Cross-modal stereoscopic vision object retrieval method and device

Info

Publication number: CN114547364A
Application number: CN202210145571.9A
Authority: CN
Inventors: 高跃; 戴岳; 赵曦滨
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-27

Abstract

The application discloses a cross-modal stereoscopic vision object retrieval method and device, wherein the method comprises the following steps: extracting depth features of each mode to obtain an example; constructing a dynamic graph structure in a modal domain, and carrying out convolutional coding on example characteristics and the relation in the example domain by using the dynamic graph to obtain enhanced characteristics in the example domain; constructing a cross-modal dynamic bipartite graph structure, and carrying out convolutional coding on example characteristics and an example cross-domain relation by using a dynamic bipartite graph to obtain example cross-domain enhancement characteristics; carrying out transform coding on the example characteristics to obtain example self-transformation characteristics; and fusing the characteristics to generate an instance fusion representation, further generating a category prediction score, optimizing the weight by using a gradient descent algorithm, and further calculating a similar score by using the cosine distance between the instance fusion representations to obtain related cross-modal retrieval data of the instance object. Therefore, the problems that direct retrieval cannot be performed among the modes, cross-mode retrieval accuracy and speed are limited and the like in the related technology are solved.

Description

Cross-modal stereoscopic vision object retrieval method and device

Technical Field

The application relates to the technical field of stereoscopic vision object retrieval, in particular to a cross-modal stereoscopic vision object retrieval method and device.

Background

Stereo vision is an important field in computer vision, and various data modalities widely exist in the field of stereo vision, including point clouds, voxels, views, grids, and the like. The diversity of data modalities is determined by the diversity of using scenes and sensors, and many directions in the field of stereoscopic vision relate to stereoscopic data of various data modalities, such as 3D printing, stereoscopic modeling, robots and the like, and data of a stereoscopic object is often required to be retrieved across modalities.

However, in the related art, the conversion cost of different modalities is expensive, usually accompanied by unrecoverable information loss, and in practical applications, the stereo object may lack data of a part of modalities, and cannot be converted into a single-modality stereo object search, so an improvement is urgently needed.

Content of application

The application provides a cross-modal stereoscopic visual object retrieval method and device, which are used for solving the problems that the related technology cannot directly retrieve among the modalities, the cross-modal retrieval precision is limited, the speed is limited and the like.

An embodiment of a first aspect of the present application provides a cross-modal stereoscopic visual object retrieval method, including the following steps: extracting depth features of each mode to obtain at least one example; constructing a dynamic graph structure in a modal domain based on the at least one instance, and using the dynamic graph convolution to encode the instance characteristics and the relationship in the instance domain to obtain enhanced characteristics in the instance domain; constructing a cross-modal dynamic bipartite graph structure based on the at least one instance, and carrying out convolution coding on the instance characteristics and the instance cross-domain relation by using the dynamic bipartite graph to obtain instance cross-domain enhancement characteristics; performing transform coding on the instance characteristics of the at least one instance to obtain instance self-transformation characteristics; fusing the instance intra-domain enhanced features, the instance cross-domain enhanced features and the instance self-transformation features to generate instance fusion representation; generating a class prediction score according to the instance fusion representation, and optimizing the weight by using a gradient descent algorithm; and based on the optimized weight, calculating a similarity score by using the cosine distance between the instance fusion representations of the at least one instance to obtain related cross-modal retrieval data of the instance object.

Optionally, in an embodiment of the present application, the extracting depth features of each modality to obtain at least one example includes: and extracting the depth features of each modal sample by using a preset depth representation model, wherein the depth representation model is obtained by one or more of point cloud stereo data, grid stereo data and view stereo data based on at least one classification task catenary.

Optionally, in an embodiment of the application, the constructing a modal domain dynamic graph structure based on the at least one instance, and using the dynamic graph convolution to encode the instance feature and the instance domain relation to obtain the instance domain enhanced feature includes: for each instance, calculating a first cosine distance between features, determining the neighbors of each instance one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in the modal domain; generating the instance intra-domain enhancement features using the dynamic graph convolution coding based on the depth features of the instances and intra-domain connections of the dynamic graph structure within the modal domain.

Optionally, in an embodiment of the present application, the constructing a cross-modal dynamic bipartite structure based on the at least one instance, and performing convolutional coding on the instance feature and the instance cross-domain relationship using a dynamic bipartite graph to obtain the instance cross-domain enhanced feature includes: for each instance, calculating a second cosine distance between the features, obtaining an intra-domain neighbor of each instance by using a nearest neighbor algorithm, establishing cross-domain connection between the instance and instances of other modalities corresponding to the intra-domain neighbor of the instance, and constructing the dynamic bipartite graph structure; generating the instance cross-domain enhancement features using the dynamic bipartite graph convolutional encoding based on the depth features of the instance and the cross-domain connections of the dynamic bipartite graph structure.

Optionally, in an embodiment of the present application, the generating the class prediction score according to the instance fusion representation and optimizing the weight by using a gradient descent algorithm includes: processing the instance fusion representation by using at least one fully connected layer to generate a category score of the instance; optimizing the learnable weights using a gradient descent method based on the class score and label class.

The second aspect of the present application provides a cross-modal stereoscopic visual object retrieving apparatus, including: the characteristic extraction module is used for extracting the depth characteristics of each mode to obtain at least one example; the intra-domain feature enhancement module is used for constructing a dynamic graph structure in a modal domain based on the at least one instance, and obtaining an instance intra-domain enhancement feature by using the dynamic graph convolution coding instance feature and the instance intra-domain relation; the cross-domain feature enhancement module is used for constructing a cross-modal dynamic bipartite graph structure based on the at least one example, and carrying out convolution coding on the example features and the example cross-domain relation by using a dynamic bipartite graph to obtain example cross-domain enhancement features; the characteristic transformation module is used for carrying out transformation coding on the example characteristic of the at least one example to obtain an example self-transformation characteristic; the characteristic fusion module is used for fusing the instance intra-domain enhanced characteristic, the instance cross-domain enhanced characteristic and the instance self-transformation characteristic to generate an instance fusion representation; the weight optimization module is used for generating a class prediction score according to the instance fusion representation and optimizing the weight by using a gradient descent algorithm; and the retrieval module is used for calculating the similarity score by using the cosine distance between the instance fusion representations of the at least one instance based on the optimized weight to obtain the related cross-modal retrieval data of the instance object.

Optionally, in an embodiment of the present application, the feature extraction module is further configured to: and extracting the depth features of each modal sample by using a preset depth representation model, wherein the depth representation model is obtained by one or more of point cloud stereo data, grid stereo data and view stereo data based on at least one classification task catenary.

Optionally, in an embodiment of the present application, the intra-domain feature enhancement module includes: the first calculation unit is used for calculating a first cosine distance between the features for each example, determining the neighbors of each example one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in the modal domain; a first generating unit, configured to generate the instance intra-domain enhancement feature using the dynamic graph convolution coding based on a depth feature of an instance and intra-domain connections of the dynamic graph structure in the modal domain.

Optionally, in an embodiment of the present application, the cross-domain feature enhancement module includes: the second calculation unit is used for calculating a second cosine distance between the features for each example, obtaining an intra-domain neighbor of each example by using a nearest neighbor algorithm, establishing cross-domain connection between the example and examples of other modes corresponding to the intra-domain neighbor of the example, and constructing the dynamic bipartite graph structure; a second generating unit, configured to generate the instance cross-domain enhancement feature using the dynamic bipartite graph convolutional coding based on the depth feature of the instance and the cross-domain connection of the dynamic bipartite graph structure.

Optionally, in an embodiment of the present application, the weight optimization module includes: a fusion unit, configured to process the instance fusion representation using at least one fully-connected layer of the at least one layer, and generate a category score for the instance; an optimization unit for optimizing the learnable weight using a gradient descent method according to the class score and the labeling class.

An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the cross-modality stereoscopic object retrieval method according to the above embodiments.

A fourth aspect of the present application provides a computer-readable storage medium storing computer instructions for causing a computer to execute a cross-modality stereoscopic visual object retrieval method according to the above embodiment.

According to the embodiment of the application, based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, the purpose of searching from the data searched in one mode to the data domain searched in the specified mode is achieved, the accuracy of searching representation is improved by using the complex relation among the data, and the accuracy and the reliability of searching are effectively guaranteed. Therefore, the problems that direct retrieval cannot be performed among the modes, cross-mode retrieval accuracy and speed are limited and the like in the related technology are solved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a cross-modal stereoscopic object retrieval method according to an embodiment of the present application;

fig. 2 is a flowchart of a cross-modal stereoscopic object retrieval method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a cross-modal stereoscopic object retrieval apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The cross-modal stereoscopic visual object retrieval method and apparatus according to the embodiments of the present application are described below with reference to the drawings. In the method, the purpose of searching from the search data of one modality to the data domain of a specified modality is realized based on constructing a dynamic graph, a dynamic bipartite graph and feature fusion of the dynamic graph and the dynamic bipartite graph, the accuracy of search representation is improved by using a complex relation between data, and the accuracy and reliability of search are effectively ensured. Therefore, the problems that direct retrieval cannot be performed among the modes, cross-mode retrieval accuracy and speed are limited and the like in the related technology are solved.

Specifically, fig. 1 is a flowchart illustrating a cross-mode stereoscopic object retrieving method according to an embodiment of the present application.

As shown in fig. 1, the cross-modality stereoscopic visual object retrieval method includes the following steps:

in step S101, depth features of each modality are extracted, and at least one instance is obtained.

In the actual execution process, at least one example can be obtained by extracting the depth features of each mode, and a basis is provided for subsequently performing intra-domain feature enhancement, cross-domain feature enhancement and the like, so that the method and the device are favorable for realizing high-precision cross-mode stereoscopic vision object retrieval.

Optionally, in an embodiment of the present application, extracting depth features of each modality to obtain at least one example includes: and extracting the depth features of each modal sample by using a preset depth representation model, wherein the depth representation model is obtained by one or more of point cloud stereo data, grid stereo data and view stereo data based on at least one classification task catenary.

Specifically, the depth features of each modal sample can be extracted by using a preset depth representation model, where the depth features include, but are not limited to, stereo data, mesh stereo data, and view stereo data, and the embodiment of the present application can train a depth representation model for a specific structure of single modal stereo data based on a classification task, and based on a fully connected layer at the end of the model, use a high-level feature of the layer as a depth feature representation example of the sample.

For example, in the embodiment of the application, a DGCNN model algorithm can be used for learning point cloud three-dimensional data, a MeshNet model algorithm is used for learning grid three-dimensional data, an MVCNN model algorithm is used for learning multi-view three-dimensional data, and then based on a model tail full connection layer, a high-level feature of the layer is used as a depth feature of a sample to represent an example.

In step S102, a modal domain dynamic graph structure is constructed based on at least one instance, and instance features and instance domain relations are convolutionally encoded using the dynamic graph to obtain instance domain enhancement features.

As a possible implementation manner, the embodiment of the present application may construct a dynamic graph structure in a modal domain based on at least one instance, and obtain an enhanced feature in the instance domain by using a dynamic graph convolution to encode an instance feature and an instance intra-domain relationship. According to the embodiment of the application, the accuracy of cross-modal stereoscopic visual object retrieval is improved by acquiring the intra-instance domain enhancement features.

Optionally, in an embodiment of the present application, constructing a modal domain-internal dynamic graph structure based on at least one instance, and using the dynamic graph convolution to encode the instance feature and the instance-internal relationship to obtain an instance-internal enhanced feature, includes: for each example, calculating a first cosine distance between the features, determining the neighbors of each example one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in a modal domain; the instance intra-domain enhancement features are generated using dynamic graph convolutional coding based on the instance depth features and the intra-domain concatenation of the modal intra-domain dynamic graph structure.

In some embodiments, for each modality instance, cosine distances between instance depth features may be calculated, neighbors of the instance may be determined one by one using a nearest neighbor algorithm, thereby establishing a dynamic graph structure, and an intra-domain enhancement feature of the instance may be generated using the proposed dynamic graph convolutional coding based on the instance depth features and intra-domain connections.

Specifically, the dynamic graph convolution calculation formula may be as follows:

wherein, the first and the second end of the pipe are connected with each other,

an example feature with sequence number j representing node mode as α, the initial example feature being the example obtained in step S101, σ representing a non-linear activation function, k being a parameter of the nearest neighbor algorithm, N_α,i、N_β,iRepresenting a neighbor sequence number set theta of a node with sequence number i constructed by a nearest neighbor algorithm_α→a、θ_β→βAre the parameters of the matrix that can be learned,

respectively, the resulting example intra-domain enhancement features.

In step S103, a cross-modal dynamic bipartite structure is constructed based on at least one instance, and instance features and an instance cross-domain relationship are convolutionally encoded using the dynamic bipartite graph to obtain instance cross-domain enhancement features.

In the actual execution process, a cross-mode dynamic bipartite graph structure can be constructed, and the dynamic bipartite graph convolutional coding example characteristics and example cross-domain relations are used, so that example cross-domain enhancement characteristics are obtained. According to the embodiment of the application, the accuracy of cross-modal stereoscopic visual object retrieval is improved by acquiring the cross-domain enhancement features of the example.

Optionally, in an embodiment of the present application, constructing a cross-modal dynamic bipartite structure based on at least one instance, and using the dynamic bipartite convolutional coding instance features and the instance cross-domain relationship to obtain the instance cross-domain enhancement features includes: for each example, calculating a second cosine distance between the features, obtaining an intra-domain neighbor of each example by using a nearest neighbor algorithm, establishing cross-domain connection between the example and other modal examples corresponding to the intra-domain neighbor of the example, and constructing a dynamic bipartite graph structure; an example cross-domain enhancement feature is generated using dynamic bipartite graph convolutional encoding based on the depth feature of the example and the cross-domain connection of the dynamic bipartite graph structure.

It can be understood that the dynamic bipartite graph describes example relations between two domains, for examples of each modality, the embodiment of the application can calculate cosine distances between example depth features, obtain intra-domain neighbors of the examples by using a nearest neighbor algorithm, establish cross-domain connections of the examples and examples of other modalities corresponding to the intra-domain neighbors of the examples, construct the dynamic bipartite graph, and simultaneously generate cross-domain enhancement features of the examples by using the proposed dynamic bipartite graph convolutional coding based on the depth features and the cross-domain connections of the examples.

The dynamic bipartite graph convolution formula is as follows:

wherein the content of the first and second substances,

an instance feature with sequence number j representing a node modality of alpha, obtained by step S101, where sigma represents a non-linear activation function, b is a parameter of the nearest neighbor algorithm,

representing a neighbor sequence number set theta of a node with sequence number i constructed by a nearest neighbor algorithm_β→α、θ_α→βAre the parameters of the matrix that can be learned,

the resulting example cross-domain enhancement features are separately represented.

In step S104, transform coding is performed on the instance feature of at least one instance to obtain an instance self-transform feature.

For example, the embodiment of the present application may use a learnable matrix to multiply the example feature, and perform a nonlinear activation function operation to obtain the self-transformation feature of the example.

The calculation formula of the example self-transformation characteristic is as follows:

wherein the content of the first and second substances,

an instance feature with index i representing a node mode as α, obtained by step S101, σ representing a non-linear activation function, θ_α、θ_βAre the parameters of the matrix that can be learned,

the resulting example self-transforming features are shown separately.

In step S105, the instance intra-domain enhanced feature, the instance cross-domain enhanced feature, and the instance self-transformation feature are fused to generate an instance fusion representation.

Specifically, the embodiment of the present application may fuse the intra-domain enhancement feature, the cross-domain enhancement feature, and the self-transformation feature of the instance obtained in the above steps, and obtain a fused representation of the instance by using the super-parameter weighted addition.

The calculation formula for generating the instance fusion representation may be as follows:

wherein λ is_S、λ_I、λ_CIn order to consider the hyper-parameters of the setting,

is a fused representation of an example.

In step S106, a class prediction score is generated from the instance fusion representation and weights are optimized using a gradient descent algorithm.

As a possible implementation manner, the embodiment of the present application may predict the score according to the fusion representation obtained in the above steps, and optimize the weight by using a gradient descent algorithm. The method and the device for searching the data domain achieve the purpose of searching the data domain from the searched data of one mode to the data domain of the specified mode based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, improve the accuracy of searching representation by using the complex relation among the data, and effectively guarantee the accuracy and the reliability of searching.

Optionally, in an embodiment of the present application, generating the class prediction score according to the instance fusion representation and optimizing the weight using a gradient descent algorithm includes: processing the instance fusion representation by using at least one layer of full connection layer to generate a class score of the instance; the learnable weights are optimized using a gradient descent method based on the class score and the labeled class.

Specifically, the embodiment of the application can process the fusion representation of the instance by using the fully-connected layer of at least one layer, thereby generating the class score of the instance, and optimizing the learnable weight in the process by using a gradient descent method according to the class score and the labeled class predicted by the data. The method and the device for searching the data domain achieve the purpose of searching the data domain from the searched data of one mode to the data domain of the specified mode based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, improve the accuracy of searching expression by optimizing learnable weight and using complex relations among data, and effectively guarantee the accuracy and reliability of searching.

In step S107, based on the optimized weight, a similarity score is calculated using a cosine distance between the instance fusion representations of at least one instance to obtain related cross-modal retrieval data of the instance object.

It can be understood that the example fusion representation refers to the example features obtained in step S105, and in the embodiment of the present application, all the example fusion representations may be calculated, and the relevant cross-modal retrieval data of the real object may be obtained by calculating the example representation distance to be retrieved and within the retrieval range. The method and the device for searching the data domain achieve the purpose of searching the data domain from the searched data of one mode to the data domain of the specified mode based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, improve the accuracy of searching representation by using the complex relation among the data, and effectively guarantee the accuracy and the reliability of searching.

The working principle of the stereoscopic object retrieving method according to the embodiment of the present application is described in detail with reference to fig. 2.

As shown in fig. 2, the embodiment of the present application may include the following steps:

step S201: depth features of each modal sample are extracted using a feature extractor. Specifically, the depth features of each modal sample can be extracted by using a preset depth representation model, where the depth features include, but are not limited to, stereo data, mesh stereo data, and view stereo data, and the embodiment of the present application can train a depth representation model for a specific structure of single modal stereo data based on a classification task, and based on a fully connected layer at the end of the model, use a high-level feature of the layer as a depth feature representation example of the sample.

Step S202: and constructing a dynamic graph structure in the modal domain, and using the dynamic graph convolution coding example characteristics and the example domain relation to obtain the example domain enhancement characteristics. In some embodiments, for each modality instance, cosine distances between instance depth features may be calculated, neighbors of the instance may be determined one by one using a nearest neighbor algorithm, thereby establishing a dynamic graph structure, and an intra-domain enhancement feature of the instance may be generated using the proposed dynamic graph convolutional coding based on the instance depth features and intra-domain connections.

Specifically, the dynamic graph convolution calculation formula is as follows:

wherein the content of the first and second substances,

an example feature with serial number j representing node mode alpha, the initial example feature being the example obtained in step S201, sigma representing a non-linear activation function, k being a parameter of the nearest neighbor algorithm, N_α,i、N_β,iRepresenting a neighbor sequence number set theta of a node with sequence number i constructed by a nearest neighbor algorithm_α→a、θ_β→βAre the parameters of the matrix that can be learned,

respectively, the resulting example intra-domain enhancement features.

Step S203: and constructing a cross-modal dynamic bipartite graph structure, and carrying out convolutional coding on the example characteristics and the example cross-domain relation by using the dynamic bipartite graph to obtain the example cross-domain enhancement characteristics. It can be understood that the dynamic bipartite graph describes example relations between two domains, for examples of each modality, the embodiment of the application can calculate cosine distances between example depth features, obtain intra-domain neighbors of the examples by using a nearest neighbor algorithm, establish cross-domain connections of the examples and examples of other modalities corresponding to the intra-domain neighbors of the examples, construct the dynamic bipartite graph, and simultaneously generate cross-domain enhancement features of the examples by using the proposed dynamic bipartite graph convolutional coding based on the depth features and the cross-domain connections of the examples.

The dynamic bipartite graph convolution formula is as follows:

wherein the content of the first and second substances,

an example feature with sequence number j representing a node modality of alpha, obtained by step S201, where sigma represents a non-linear activation function, b is a parameter of the nearest neighbor algorithm,

Step S204: and carrying out transform coding on the characteristics of the example per se to obtain the self-transform characteristics of the example. For example, the embodiment of the present application may use a learnable matrix to multiply the example feature, and perform a nonlinear activation function operation to obtain the self-transformation feature of the example.

wherein the content of the first and second substances,

an example feature with index i representing the node mode as α, obtained by step S201, σ represents a nonlinear activation function, θ_α、θ_βAre the parameters of the matrix that can be learned,

the resulting example self-transforming features are shown separately.

Step S205: and fusing the intra-domain enhanced features, the cross-domain enhanced features and the self-transformation features of the instances to generate a fused representation of the instances. Specifically, the embodiment of the present application may fuse the intra-domain enhancement feature, the cross-domain enhancement feature, and the self-transformation feature of the instance obtained in the above steps, and obtain a fused representation of the instance by using the super-parameter weighted addition.

The calculation formula for generating the instance fusion expression is as follows:

is a fused representation of an example.

Step S206: generating a class prediction score from the fused representation, and optimizing the weights using a gradient descent algorithm. Specifically, the embodiment of the application can process the fusion representation of the instance by using the fully-connected layer of at least one layer, thereby generating the class score of the instance, and optimizing the learnable weight in the process by using a gradient descent method according to the class score and the labeled class predicted by the data.

Step S207: and calculating a similarity score by using the cosine distance between the example fusion representations to obtain related cross-modal retrieval data of the example object. It can be understood that the example fusion representation refers to the example features obtained in step S205, and in the embodiment of the present application, all the example fusion representations may be calculated, and the relevant cross-modal retrieval data of the real object is obtained by calculating the example representation distance to be retrieved and within the retrieval range. The method and the device for searching the data domain achieve the purpose of searching the data domain from the searched data of one mode to the data domain of the specified mode based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, improve the accuracy of searching representation by using the complex relation among the data, and effectively guarantee the accuracy and the reliability of searching.

According to the cross-modal stereoscopic vision object retrieval method provided by the embodiment of the application, the purpose of retrieving data from one modal to the data domain of the specified modal is realized based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, the accuracy of retrieval representation is improved by using the complex relation among data, and the accuracy and the reliability of retrieval are effectively ensured. Therefore, the problems that direct retrieval cannot be performed among the modes, cross-mode retrieval accuracy and speed are limited and the like in the related technology are solved.

Next, a cross-modal stereoscopic visual object retrieval apparatus according to an embodiment of the present application will be described with reference to the drawings.

Fig. 3 is a block diagram schematically illustrating a cross-modality stereoscopic object retrieval apparatus according to an embodiment of the present application.

As shown in fig. 3, the stereoscopic-vision object search device 10 for cross-modality includes: the system comprises a feature extraction module 100, an intra-domain feature enhancement module 200, a cross-domain feature enhancement module 300, a feature transformation module 400, a feature fusion module 500, a weight optimization module 600 and a retrieval module 700.

Specifically, the feature extraction module 100 is configured to extract depth features of each modality, so as to obtain at least one example.

And the intra-domain feature enhancement module 200 is configured to construct a modal intra-domain dynamic graph structure based on at least one instance, and obtain an instance intra-domain enhancement feature by using the dynamic graph convolution coding instance feature and the instance intra-domain relation.

The cross-domain feature enhancement module 300 is configured to construct a cross-modal dynamic bipartite graph structure based on at least one instance, and obtain an instance cross-domain enhancement feature by using a dynamic bipartite graph convolution to encode an instance feature and an instance cross-domain relationship.

A feature transformation module 400, configured to perform transform coding on the instance feature of at least one instance to obtain an instance self-transformation feature.

And the feature fusion module 500 is used for fusing the instance intra-domain enhanced features, the instance cross-domain enhanced features and the instance self-transformation features to generate an instance fusion representation.

And a weight optimization module 600, configured to generate the class prediction score according to the instance fusion representation, and optimize the weight using a gradient descent algorithm.

And the retrieval module 700 is configured to calculate a similarity score using a cosine distance between the instance fusion representations of the at least one instance based on the optimized weight to obtain related cross-modal retrieval data of the instance object.

Optionally, in an embodiment of the present application, the feature extraction module 100 is further configured to extract depth features of each modal sample by using a preset depth representation model, where the depth representation model is obtained by suspending one or more of point cloud stereo data, mesh stereo data, and view stereo data based on at least one classification task.

Optionally, in an embodiment of the present application, the intra-domain feature enhancement module 200 includes: a first calculating unit and a first generating unit.

The first calculating unit is used for calculating a first cosine distance between the features for each example, determining the neighbors of each example one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in the modal domain.

And the first generation unit is used for generating the enhancement features in the example domain by using the dynamic graph convolution coding based on the depth features of the examples and the intra-domain connection of the dynamic graph structure in the modal domain.

Optionally, in an embodiment of the present application, the cross-domain feature enhancement module 300 includes: a second calculation unit and a second generation unit.

The second calculation unit is used for calculating a second cosine distance between the features for each example, obtaining an intra-domain neighbor of each example by using a nearest neighbor algorithm, establishing cross-domain connection between the example and examples of other modes corresponding to the intra-domain neighbor of the example, and constructing a dynamic bipartite graph structure.

And the second generation unit is used for generating the example cross-domain enhanced feature by using the dynamic bipartite graph convolutional coding based on the example depth feature and the cross-domain connection of the dynamic bipartite graph structure.

Optionally, in an embodiment of the present application, the weight optimization module 600 includes: a fusion unit and an optimization unit.

And the fusion unit is used for processing the instance fusion representation by using the full connection layer of at least one layer and generating the class score of the instance.

And the optimization unit is used for optimizing the learnable weight by using a gradient descent method according to the category score and the labeled category.

It should be noted that the foregoing explanation of the cross-modal stereoscopic object retrieving method embodiment is also applicable to the cross-modal stereoscopic object retrieving apparatus of this embodiment, and is not repeated here.

According to the cross-modal stereoscopic vision object retrieval device provided by the embodiment of the application, the purpose of retrieving data from one modal to a data domain of a specified modal is realized based on the construction of the dynamic graph, the dynamic bipartite graph and the feature fusion of the dynamic bipartite graph, the retrieval accuracy is improved by using the complex relation among data, and the retrieval accuracy and reliability are effectively ensured. Therefore, the problems that direct retrieval cannot be performed among the modes, cross-mode retrieval accuracy and speed are limited and the like in the related technology are solved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

memory 401, processor 402, and computer programs stored on memory 401 and executable on processor 402.

The processor 402, when executing the program, implements the cross-modal stereoscopic object retrieval method provided in the above-described embodiments.

Further, the electronic device further includes:

a communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing computer programs executable on the processor 402.

Memory 401 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 401, the processor 402 and the communication interface 403 are implemented independently, the communication interface 403, the memory 401 and the processor 402 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Alternatively, in practical implementation, if the memory 401, the processor 402 and the communication interface 403 are integrated on a chip, the memory 401, the processor 402 and the communication interface 403 may complete communication with each other through an internal interface.

The processor 402 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the cross-modality stereoscopic visual object retrieval method as above.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A cross-modal stereoscopic visual object retrieval method is characterized by comprising the following steps:

extracting depth features of each mode to obtain at least one example;

constructing a dynamic graph structure in a modal domain based on the at least one instance, and using the dynamic graph convolution to encode the instance characteristics and the relationship in the instance domain to obtain enhanced characteristics in the instance domain;

constructing a cross-modal dynamic bipartite graph structure based on the at least one instance, and carrying out convolution coding on the instance characteristics and the instance cross-domain relation by using the dynamic bipartite graph to obtain instance cross-domain enhancement characteristics;

carrying out transform coding on the example characteristics of the at least one example to obtain example self-transformation characteristics;

fusing the instance intra-domain enhanced features, the instance cross-domain enhanced features and the instance self-transformation features to generate instance fusion representation;

generating a class prediction score according to the instance fusion representation, and optimizing the weight by using a gradient descent algorithm; and

based on the optimized weight, calculating a similarity score by using the cosine distance between the instance fusion representations of the at least one instance to obtain related cross-modal retrieval data of the instance object.

2. The method according to claim 1, wherein the extracting depth features of each modality to obtain at least one instance comprises:

and extracting the depth features of each modal sample by using a preset depth representation model, wherein the depth representation model is obtained by one or more of point cloud stereo data, grid stereo data and view stereo data based on at least one classification task catenary.

3. The method according to claim 1, wherein constructing a modal domain-wide dynamic graph structure based on the at least one instance, and convolutionally encoding instance features and instance-domain relationships using a dynamic graph to obtain instance-domain enhanced features comprises:

for each instance, calculating a first cosine distance between features, determining the neighbors of each instance one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in the modal domain;

generating the instance intra-domain enhancement features using the dynamic graph convolution coding based on depth features of instances and intra-domain connections of dynamic graph structures within the modal domain.

4. The method of claim 3, wherein constructing a cross-modal dynamic bipartite structure based on the at least one instance, convolutionally encoding the instance features and instance cross-domain relationships using a dynamic bipartite graph to obtain instance cross-domain enhancement features, comprises:

for each instance, calculating a second cosine distance between the features, obtaining an intra-domain neighbor of each instance by using a nearest neighbor algorithm, establishing cross-domain connection between the instance and instances of other modalities corresponding to the intra-domain neighbor of the instance, and constructing the dynamic bipartite graph structure;

generating the instance cross-domain enhancement features using the dynamic bipartite graph convolutional encoding based on the depth features of the instance and the cross-domain connections of the dynamic bipartite graph structure.

5. The method according to any one of claims 1-4, wherein generating class prediction scores from the instance fused representation and optimizing weights using a gradient descent algorithm comprises:

processing the instance fusion representation by using at least one fully connected layer to generate a class score of the instance;

optimizing the learnable weights using a gradient descent method based on the class score and labeling class.

6. A cross-modal stereoscopic object retrieval apparatus, comprising:

the characteristic extraction module is used for extracting the depth characteristics of each mode to obtain at least one example;

the intra-domain feature enhancement module is used for constructing a dynamic graph structure in a modal domain based on the at least one instance, and obtaining an instance intra-domain enhancement feature by using the dynamic graph convolution coding instance feature and the instance intra-domain relation;

the cross-domain feature enhancement module is used for constructing a cross-modal dynamic bipartite graph structure based on the at least one example, and carrying out convolution coding on the example features and the example cross-domain relation by using a dynamic bipartite graph to obtain example cross-domain enhancement features;

the characteristic transformation module is used for carrying out transformation coding on the example characteristic of the at least one example to obtain an example self-transformation characteristic;

the characteristic fusion module is used for fusing the instance intra-domain enhanced characteristic, the instance cross-domain enhanced characteristic and the instance self-transformation characteristic to generate an instance fusion representation;

the weight optimization module is used for generating a class prediction score according to the instance fusion representation and optimizing the weight by using a gradient descent algorithm; and

and the retrieval module is used for calculating the similarity score by using the cosine distance between the instance fusion representations of the at least one instance based on the optimized weight to obtain the related cross-modal retrieval data of the instance object.

7. The apparatus of claim 6, wherein the intra-domain feature enhancement module comprises:

the first calculation unit is used for calculating a first cosine distance between the features for each example, determining the neighbors of each example one by using a nearest neighbor algorithm, and establishing a dynamic graph structure in the modal domain;

a first generating unit, configured to generate the instance intra-domain enhancement feature using the dynamic graph convolution coding based on a depth feature of an instance and intra-domain connections of the dynamic graph structure in the modal domain.

8. The apparatus of claim 7, wherein the cross-domain feature enhancement module comprises:

the second calculation unit is used for calculating a second cosine distance between the features for each example, obtaining an intra-domain neighbor of each example by using a nearest neighbor algorithm, establishing cross-domain connection between the example and examples of other modes corresponding to the intra-domain neighbor of the example, and constructing the dynamic bipartite graph structure;

a second generation unit to generate the instance cross-domain enhancement feature using the dynamic bipartite graph convolutional coding based on the depth feature of the instance and the cross-domain connection of the dynamic bipartite graph structure.

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the cross-modal stereoscopic object retrieval method according to any of claims 1-5.

10. A computer-readable storage medium having stored thereon a computer program, the program being executable by a processor for implementing a cross-modal stereoscopic object retrieval method as claimed in any one of claims 1 to 5.