WO2023201990A1 - 一种视觉定位方法、装置、设备及介质 - Google Patents

一种视觉定位方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023201990A1
WO2023201990A1 PCT/CN2022/122335 CN2022122335W WO2023201990A1 WO 2023201990 A1 WO2023201990 A1 WO 2023201990A1 CN 2022122335 W CN2022122335 W CN 2022122335W WO 2023201990 A1 WO2023201990 A1 WO 2023201990A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
preset
unit
text
Prior art date
Application number
PCT/CN2022/122335
Other languages
English (en)
French (fr)
Inventor
李晓川
李仁刚
赵雅倩
郭振华
范宝余
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023201990A1 publication Critical patent/WO2023201990A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a visual positioning method, device, equipment and medium.
  • Multi Modal has become a very important research direction in the field of artificial intelligence. Due to its emphasis on the integration of visual, text, speech and other information, various multi-modal related algorithms are also emerging in endlessly: various methods based on convolutional neural networks (Convolutional Neural Networks, CNN) and attention mechanisms (attention) have their own It has a wide range of applications and has become a mainstream method in fields such as Visual Commonsense Reasoning (VCR), Visual Question Answering (VQA), and Visual Grounding (VG).
  • VCR Visual Commonsense Reasoning
  • VQA Visual Question Answering
  • VG Visual Grounding
  • the visual positioning task is one of the important research directions in the field of multi-modal artificial intelligence. This task aims to locate the relevant objects in the picture based on the description and give the coordinate position of the object.
  • errors in "text" also known as noise.
  • Text errors are text distortions caused by "human” reasons. Normally, text errors are caused by human slips of the tongue, subjective biases when describing objects, and ambiguities in description sentences. These errors are very common in daily life, but they are easy to cause in the AI algorithm design process. Ignored, this becomes a barrier between existing methods and their implementation. In short, when there are certain errors in the input text, it is difficult for existing methods to find and locate the object that the sentence itself wants to describe.
  • This application provides a visual positioning method, including:
  • the preset noise correction unit is used to perform image and text noise correction processing on the first fused coding feature and the text coding feature respectively to obtain the corrected fusion feature and the corrected text coding feature.
  • the preset noise correction unit is based on the preset self-attention Mechanisms and presets are built across units of attention mechanism;
  • the preset target frame correction unit Through the preset target frame correction unit, and using the target coding features determined based on the corrected fusion features and the second fused coding features, the preset frame features are corrected, and the corrected frame features are used to predict the position of the target visual object in the target image.
  • the preset target frame correction unit is a unit built based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • the preset image and text feature fusion unit built based on the preset self-attention mechanism before using the preset image and text feature fusion unit built based on the preset self-attention mechanism to perform image and text feature fusion processing on the spliced encoded features, it also includes:
  • the preset image and text feature fusion unit built based on the preset self-attention mechanism is used to perform image and text feature fusion processing on the spliced coding features to obtain the first fused coding features, including:
  • the first fused coding feature is obtained according to the current operation processing result.
  • obtaining the first fused coding feature according to the current operation processing result includes:
  • the current operation processing result is used as the first fused coding feature.
  • the current image and text feature fusion sub-unit is used to sequentially perform self-attention operation, layer normalization operation, feature deletion operation and feature addition operation on the features to be processed to obtain the corresponding current operation processing result.
  • the feature addition unit in the current image and text feature fusion sub-unit is used to perform a feature addition operation on the third operation feature and the feature to be processed to obtain the calculation processing result in the current image and text feature fusion sub-unit.
  • the preset noise correction unit before using the preset noise correction unit to respectively perform image and text noise correction processing on the first fused coding feature and the text coding feature, it also includes:
  • the first noise correction sub-unit is constructed using the self-attention operation unit, feature deletion unit, layer normalization unit, and feature addition unit built based on the preset self-attention mechanism;
  • a preset noise correction unit is constructed by sequentially connecting the first noise correction subunit and a second preset number of second noise correction subunits in series.
  • a preset noise correction unit is used to perform image and text noise correction processing on the first fused coding feature and the text coding feature respectively to obtain the modified fusion feature and the modified text coding feature, including:
  • the first second noise correction sub-unit in the preset noise correction unit is regarded as the current second noise correction sub-unit, and the first operation processing results corresponding to the first fused coding feature and the text coding feature are regarded as the current Features to be processed;
  • the current second noise correction sub-unit is used to sequentially perform cross-attention operations, feature deletion operations, layer normalization operations and feature addition operations on the features to be processed to obtain the current first fused coding features and text coding features corresponding to each other. 2. The result of operation processing; and
  • the modified text encoding feature is obtained according to the current second operation processing result.
  • obtaining modified text encoding features based on the current second operation processing result includes:
  • the current second noise correction sub-unit In response to the current second noise correction sub-unit being not the last one, the current second noise correction sub-unit is updated to the next second noise correction sub-unit, the feature to be processed is updated to the current second operation processing result, and execution is returned.
  • the current second operation processing results corresponding to the first fused coding feature and the text coding feature are respectively used as the modified fusion feature and the modified text coding feature.
  • the method before correcting the preset frame features through the preset target frame correction unit and using the target coding features determined based on the corrected fusion features and the second fused coding features, the method further includes:
  • the first target frame correction subunit is constructed using the self-attention operation unit, feature deletion unit, layer normalization unit, and feature addition unit built based on the preset self-attention mechanism;
  • a preset target frame correction unit is constructed by serially connecting the first target frame correction subunit and a third preset number of second target frame correction subunits.
  • the preset frame features are corrected through the preset target frame correction unit and using the target coding features determined based on the revised fusion features and the second fused coding features, including:
  • the target coding features and the preset frame features are input to the first target frame correction subunit in the preset target frame correction unit, so that the target coding features and the preset frame features are respectively subjected to self-attention operation, feature deletion operation, layer Normalization operation and feature addition operation to obtain third operation processing results corresponding to the target encoding features and the preset frame features;
  • the first second target frame correction sub-unit in the preset target frame correction unit is used as the current second target frame correction sub-unit, and the third operation processing results corresponding to the target encoding features and the preset frame features are used as the current Characteristics to be processed;
  • the current second target frame correction subunit is used to sequentially perform cross-attention operations, feature deletion operations, layer normalization operations, and feature addition operations on the features to be processed to obtain the current fourth corresponding target encoding features and preset frame features. The result of the operation;
  • the modified frame features are obtained according to the current fourth operation processing result.
  • obtaining modified frame features based on the current fourth operation processing result includes:
  • the current fourth operation processing result is used as the corrected frame feature.
  • the method before modifying the preset frame features using the target encoding features determined based on the modified fusion features and the second fused encoding features, the method further includes:
  • the features obtained by performing preset operation processing on the modified fusion feature or the second fusion coding feature or the modified fusion feature and the second fusion coding feature are determined as target coding features; wherein, the preset operation processing includes the modified fusion feature Perform feature addition or feature splicing with the second fused encoded feature.
  • the formula for adding features is:
  • f modify is the modified fusion feature
  • f denoise is the noise reduction fusion feature
  • f cat is the output after feature addition.
  • the formula for feature splicing is:
  • f modify is the modified fusion feature
  • f denoise is the noise reduction fusion feature
  • f cat is the output after feature splicing.
  • using modified box features to predict the regional position coordinates of the target visual object on the target image includes:
  • the corrected frame features into the coordinate predictor built based on the first fully connected layer and the second fully connected layer;
  • the first fully connected layer is a fully connected layer used to predict the confidence of the initial target frame, and the second fully connected layer is a fully connected layer used for coordinate regression processing of the initial target frame;
  • the regional position coordinates of the target visual object on the target image are determined based on the confidence of each initial target frame and the coordinates of each initial target frame.
  • determining the regional position coordinates of the target visual object on the target image based on the confidence of each initial target frame and the coordinates of each initial target frame includes:
  • Sort the confidence in descending order then filter the initial target box with the highest confidence from the descending sorting results, and determine the coordinates of the filtered initial target box as the regional position coordinates of the target visual object on the target image.
  • a visual positioning method further includes:
  • the first preset formula is:
  • f is the input of each preset self-attention mechanism
  • W q , W k and W v represent the mapping matrix
  • size(f) represents the dimension
  • attn self (f): represents the output of the preset self-attention mechanism
  • softmax represents the activation function
  • a visual positioning method further includes:
  • the second preset formula is:
  • f and g respectively represent the two input features participating in the cross-attention operation in each cross-attention layer in the preset cross-attention mechanism
  • size(g) represents the dimension
  • attn cross (f, g) represents the preset cross-attention layer.
  • softmax represents the activation function.
  • This application also provides a visual positioning device, including:
  • the feature splicing module is used to encode the target image and the target text respectively, and perform feature splicing on the image encoding features and text encoding features obtained after encoding to obtain the spliced encoding features;
  • the first feature fusion module is used to use the preset image and text feature fusion unit built based on the preset self-attention mechanism to perform image and text feature fusion processing on the spliced coding features to obtain the first fused coding features;
  • the noise correction module is used to perform image and text noise correction processing on the first fused coding features and the text coding features using a preset noise correction unit to obtain the corrected fusion features and the corrected text coding features.
  • the preset noise correction unit is based on The preset self-attention mechanism and the preset cross-attention mechanism are constructed;
  • the second feature fusion module is used to input the spliced coding features and the modified text coding features into the preset image and text feature fusion unit to obtain the second fused coding features;
  • the target frame correction unit is used to correct the preset frame features through the preset target frame correction unit and use the target coding features determined based on the corrected fusion features and the second fused coding features, and use the corrected frame features
  • the preset target frame correction unit is a unit built based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • This application also provides an electronic device, including a memory and one or more processors.
  • Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by one or more processors, they cause the one or more processors to execute The steps of any of the above visual positioning methods.
  • the application also provides one or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, they cause the one or more processors to execute any of the above. It is a step of a visual positioning method.
  • Figure 1 is a flow chart of a visual positioning method disclosed in one or more embodiments of the present application.
  • Figure 2 is a sub-flow chart of a specific visual positioning method disclosed in one or more embodiments of the present application.
  • Figure 3 is a sub-flow chart of a specific visual positioning method disclosed in one or more embodiments of the present application.
  • Figure 4 is a schematic structural diagram of a traditional visual positioning task
  • Figure 5 is a schematic structural diagram of an anti-noise visual positioning task disclosed in one or more embodiments of the present application.
  • Figure 6 is a flow chart of a traditional visual positioning method
  • Figure 7 is a flow chart of a specific visual positioning method disclosed in one or more embodiments of the present application.
  • Figure 8 is a schematic structural diagram of a preset image and text feature fusion unit disclosed in one or more embodiments of the present application.
  • Figure 9 is a schematic structural diagram of a preset noise correction unit disclosed in one or more embodiments of the present application.
  • Figure 10 is a schematic structural diagram of a preset target frame correction unit disclosed in one or more embodiments of the present application.
  • Figure 11 is a schematic structural diagram of a coordinate predictor disclosed in one or more embodiments of the present application.
  • Figure 12 is a schematic structural diagram of a visual positioning device disclosed in one or more embodiments of the present application.
  • Figure 13 is a structural diagram of an electronic device disclosed in one or more embodiments of the present application.
  • embodiments of the present application propose a visual positioning solution that can avoid the impact of noise caused by human language text errors on visual positioning and achieve noise-resistant visual positioning.
  • the embodiment of the present application discloses a visual positioning method, as shown in Figure 1.
  • the method includes:
  • Step S11 Encode the target image and the target text respectively, and perform feature splicing on the image encoding features and text encoding features obtained after encoding to obtain the spliced encoding features.
  • the encoder that encodes the target image and the target text can use a classic model.
  • the image encoder that encodes the target image can use the convolutional neural network ResNet (Residual Neural Network), ResNext, etc. to encode the target image.
  • Image encoders that encode text can use Roberta, Bert (Bidirectional Encoder Representation from Transformers), etc.
  • the image coding features and the text coding features obtained after coding are spliced to obtain the spliced coding features.
  • the image coding features and the text coding features can be input as a whole to the following One processing unit.
  • Step S12 Use a preset image-text feature fusion unit built based on a preset self-attention mechanism to perform image-text feature fusion processing on the spliced coding features to obtain the first fused coding features.
  • the preset image and text feature fusion unit built based on the preset self-attention mechanism before using the preset image and text feature fusion unit built based on the preset self-attention mechanism to perform image and text feature fusion processing on the spliced encoded features, it is necessary to use the preset self-attention mechanism built based on the self-attention mechanism.
  • the attention operation unit, layer normalization unit, feature deletion unit and feature addition unit construct a graphic and text feature fusion sub-unit; by serially connecting a first preset number of the graphic and text feature fusion sub-units to construct The preset image and text feature fusion unit is obtained, and the feature deletion unit is used to randomly delete features according to a certain proportion. In this way, the system can be prevented from overfitting.
  • the first image and text feature fusion sub-unit in the preset image and text feature fusion unit is used as the current image and text feature fusion sub-unit, and Use the spliced encoded features as features to be processed; input the features to be processed into the current image and text feature fusion subunit; use the current image and text feature fusion subunit to sequentially perform self-attention operations on the features to be processed, Layer normalization operation, feature deletion operation and feature addition operation to obtain the corresponding current operation processing result; determine whether the current image and text feature fusion sub-unit is the last; if not, update the current image and text feature fusion sub-unit For the next image and text feature fusion sub-unit, update the feature to be processed to the current calculation processing result, and return to the step of inputting the feature to be processed into the current image and text feature fusion sub-unit; if , then the current operation processing result is used as the first fused coding feature.
  • the current image and text feature fusion sub-unit is used to sequentially perform self-attention operation, layer normalization operation, feature deletion operation and feature addition operation on the features to be processed to obtain the corresponding operation processing results. Specifically, it includes: using the self-attention operation unit in the current image and text feature fusion sub-unit to perform self-attention operation on the feature to be processed to obtain the first operation feature; using the self-attention operation unit in the current image and text feature fusion sub-unit.
  • the layer normalization unit performs layer normalization processing on the first operation feature to obtain the second operation feature; the feature deletion unit in the current image and text feature fusion sub-unit is used to perform layer normalization processing on the second operation feature according to a preset ratio.
  • the operation feature performs a feature deletion operation to obtain the third operation feature; the feature addition unit in the current image and text feature fusion sub-unit is used to perform a feature addition operation on the third operation feature and the feature to be processed, so as to Obtain the operation processing result in the current image and text feature fusion subunit.
  • this embodiment focuses on describing image-text matching when performing the first image-text feature fusion process on the spliced encoded features using the preset image-text feature fusion unit built based on the preset self-attention mechanism. relationship, that is, based on the image-text matching relationship, the mismatched part between the target image and the target text is inferred, that is, the noise part.
  • Step S13 Use a preset noise correction unit to perform image and text noise correction processing on the first fused coding feature and the text coding feature respectively to obtain the revised fusion feature and the revised text coding feature, the preset noise
  • the correction unit is a unit built based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • this embodiment adds a preset noise correction unit based on the preset self-attention mechanism and the preset cross-attention mechanism, so that it can be based on the preset cross-attention mechanism.
  • the force mechanism performs image and text noise correction processing on the first fused coding feature and the text coding feature to reduce attention to the noise part in the text. In this way, the impact of noise is weakened and anti-noise vision is achieved. position.
  • the method before using the preset noise correction unit to respectively perform image and text noise correction processing on the first fused coding features and the text coding features, the method further includes: using self-attention based on a preset self-attention mechanism.
  • the force operation unit, feature deletion unit, layer normalization unit, and feature addition unit construct the first noise correction subunit; the cross-attention operation unit, feature deletion unit, and layer normalization built based on the preset cross-attention mechanism are used unit and feature addition unit to construct a second noise correction sub-unit; by sequentially connecting the first noise correction sub-unit and a second preset number of second noise correction sub-units in series to construct the predetermined Set up the noise correction unit.
  • Step S14 Input the spliced coding features and the modified text coding features into the preset image and text feature fusion unit to obtain second fused coding features.
  • the noise is known Part of the premise focuses on integrating graphic and text features.
  • Step S15 Modify the preset frame features through the preset target frame correction unit and use the target coding features determined based on the corrected fusion features and the second fused coding features, and use the corrected frame features Predict the regional position coordinates of the target visual object on the target image.
  • the preset target frame correction unit is a unit built based on a preset self-attention mechanism and a preset cross-attention mechanism.
  • the modified fusion features or the second fused coding features or the modified fusion features and the second fused coding features need to be processed.
  • the features obtained after the preset operation processing are determined as the target coding features; wherein the preset operation processing includes feature addition or feature splicing of the modified fused features and the second fused coding features, such that Then, the target encoding features can be used to modify the preset frame features.
  • using the corrected frame features to predict the regional position coordinates of the target visual object on the target image specifically includes: inputting the corrected frame features into a network constructed based on the first fully connected layer and the second fully connected layer.
  • Coordinate predictor the first fully connected layer is a fully connected layer used to predict the confidence of the initial target frame
  • the second fully connected layer is a fully connected layer used to perform coordinate regression processing on the initial target frame
  • the coordinate predictor and the modified frame feature determine the confidence of each initial target frame; sort the confidence in descending order, and then select the initial target frame with the highest confidence from the descending sorting results. , and determine the coordinates of the filtered initial target frame as the regional position coordinates of the target visual object on the target image.
  • the preset target frame correction unit before correcting the preset frame features through the preset target frame correction unit and using the target coding features determined based on the corrected fusion features and the second fused coding features, it also includes: : Use the self-attention operation unit, feature deletion unit, layer normalization unit, and feature addition unit built based on the preset self-attention mechanism to construct the first target frame correction subunit; use the preset cross-attention mechanism to build the first target frame correction subunit A second target frame correction subunit is constructed across the attention operation unit, feature deletion unit, layer normalization unit, and feature addition unit; by combining the first target frame correction subunit and a third preset number of The two target frame correction sub-units are connected in series to construct the preset target frame correction unit.
  • this application discloses a visual positioning method, which includes: encoding the target image and the target text respectively, and splicing the image encoding features and text encoding features obtained after encoding to obtain the spliced encoding features; using pre-coded It is assumed that the preset image and text feature fusion unit constructed from the attention mechanism performs image and text feature fusion processing on the spliced coding features to obtain the first fused coding features; the preset noise correction unit is used to perform image and text feature fusion processing on the first fused coding features respectively. The post-encoding features and the text encoding features are subjected to image and text noise correction processing to obtain the corrected fusion features and the corrected text encoding features.
  • the preset noise correction unit is based on a preset self-attention mechanism and a preset cross-attention mechanism.
  • the unit of mechanism construction input the spliced coding features and the modified text coding features into the preset image and text feature fusion unit to obtain the second fused coding features; pass the preset target frame correction unit, and use Based on the target coding features determined by the modified fusion features and the second fused coding features, the preset frame features are corrected, and the corrected frame features are used to predict the regional position of the target visual object on the target image. coordinates.
  • the preset target frame correction unit is a unit constructed based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • this application is constructed based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • the noise correction unit is used to correct image and text noise. Since in the process of processing based on the cross-attention mechanism, the difference between the text and the image cannot find a matching relationship in the image, thus reducing the focus on the noise components of the image and text. degree, thereby weakening the impact of image and text noise on visual positioning accuracy, that is, achieving anti-noise visual positioning.
  • step S13 includes:
  • Step S131 Input the first fused coding feature and the text coding feature to the first noise correction sub-unit in the preset noise correction unit, so as to modify the first fused coding feature and the text coding feature.
  • the above-mentioned text encoding features are respectively subjected to self-attention operation, feature deletion operation, layer normalization operation and feature addition operation to obtain the first fusion encoding features and the corresponding first operation processing of the text encoding features. result.
  • Step S132 Use the first second noise correction sub-unit in the preset noise correction unit as the current second noise correction sub-unit, and use the first fused coding feature and the text coding feature to correspond to The results of the first operation are all used as the current features to be processed.
  • Step S133 Input the features to be processed into the current second noise correction subunit.
  • Step S134 Use the current second noise correction subunit to sequentially perform cross-attention operations, feature deletion operations, layer normalization operations and feature addition operations on the features to be processed to obtain the first fused coding features and The current second operation processing result corresponding to each of the text encoding features.
  • Step S135 Determine whether the current second noise correction subunit is the last one.
  • Step S136 If not, update the current second noise correction sub-unit to the next second noise correction sub-unit, update the feature to be processed to the current second operation processing result, and return to execute the Describe the step of inputting the features to be processed into the current second noise correction subunit.
  • Step S137 If yes, use the current second operation processing results corresponding to the first fused coding feature and the text coding feature as the modified fusion feature and the modified text coding feature respectively.
  • this embodiment performs image and text noise correction processing on the first fused coding feature and the text coding feature based on the preset self-attention mechanism and the preset cross-attention mechanism, so as to reduce the impact on the noisy part of the text. attention, thus weakening the impact of noise and achieving anti-noise visual positioning.
  • this application performs image and text noise correction processing through a noise correction unit built based on a preset self-attention mechanism and a preset cross-attention mechanism. Since in the process of processing based on the cross-attention mechanism, the text is relatively The difference in the image cannot find a matching relationship in the image, thus reducing the attention to the image and text noise components, thereby weakening the impact of image and text noise on the accuracy of visual positioning, that is, achieving anti-noise visual positioning.
  • step S15 the process of modifying the preset frame features in step S15 specifically includes:
  • Step S151 Input the target coding features and the preset frame features to the first target frame correction sub-unit in the preset target frame correction unit, so as to modify the target coding features and the preset frame features.
  • the frame features are respectively subjected to self-attention operation, feature deletion operation, layer normalization operation and feature addition operation to obtain third operation processing results corresponding to the target encoding features and the preset frame features.
  • Step S152 Use the first second target frame correction sub-unit in the preset target frame correction unit as the current second target frame correction sub-unit, and correspond the target coding features and the preset frame features respectively.
  • the results of the third operation processing are all used as the current features to be processed.
  • Step S153 Input the features to be processed into the current second target frame correction subunit.
  • Step S154 Use the current second target frame correction subunit to sequentially perform cross-attention operations, feature deletion operations, layer normalization operations, and feature addition operations on the features to be processed to obtain the target coding features and the The current fourth operation processing result corresponding to each preset frame feature.
  • Step S155 Determine whether the current second target frame correction subunit is the last one.
  • Step S156 If not, update the current second target frame correction subunit to the next second target frame correction subunit, update the feature to be processed to the current fourth operation processing result, and return to execute the The step of inputting the features to be processed into the current second target frame correction subunit.
  • Step S157 If yes, use the current fourth operation processing result as the modified frame feature.
  • the preset target frame correction unit is constructed based on the preset self-attention mechanism and the preset cross-attention mechanism. In this way, the preset frame features are corrected, and the corrected features are used To predict the regional position coordinates of the target visual object on the target image.
  • Figure 4 is a schematic structural diagram of a traditional visual positioning task when the human language is correct.
  • Figure 5 is a schematic structural diagram of the anti-noise visual positioning task disclosed in this application when the human language has errors.
  • Figure 6 is a flow chart of a traditional visual positioning method.
  • the image encoder and text encoder are set to encode the input image and text, and the image features and text features obtained after encoding are spliced, and then the spliced results are obtained
  • the features are fused and encoded to obtain the fused encoding features; a set of preset boxes is set to simulate the box information.
  • n the number of presets
  • d the dimension of the feature
  • the fused coding features and the preset frame features are cross-coded to correct the preset frame features; the corrected frame information is used for confidence prediction and coordinate position regression, and the frame coordinates with the highest confidence are determined as visual positioning The final output of the system.
  • Figure 7 is a flow chart of a specific anti-noise visual positioning method disclosed in this application.
  • this application adds a noise correction unit to achieve the purpose of anti-noise, thereby realizing the function of anti-noise visual positioning.
  • the visual positioning method process of this application specifically includes image and text encoders, a preset image and text feature fusion unit, a preset noise correction unit, a cross-coding unit, and a coordinate predictor.
  • the preset image and text feature fusion unit also called a fusion encoder, is used to fuse image and text features to learn the matching relationship between images and text.
  • This module has two functions. On the one hand, it can be used to encode the relationship between images and texts. Therefore, for the case where the text is noisy, the encoder can realize the difference encoding between images and texts, and is ultimately used to generate image-text matching. relation. On the other hand, it can be used to fuse image and text features, which is meaningful for the visual positioning task itself, so it can be placed as an encoder before the final positioning module; this module appears twice in the system involved in this application, first The first time focuses on describing the matching relationship between images and text, and the second time is used to fuse image and text features. Since the two functions do not conflict, in order to save computing power, the weights of the two modules are shared.
  • the schematic structural diagram of the preset image and text feature fusion unit designed in this application is shown in Figure 8.
  • the image coding feature f raw and the text coding feature g raw are spliced; then, they are input into a first preset number of image and text feature fusion sub-units for coding, and the final fused coding feature is obtained.
  • Each image and text feature fusion subunit includes a self-attention layer, a layer normalization layer, a random deletion layer, and an addition module.
  • the self-attention layer a preset self-attention mechanism is created based on the first preset formula.
  • the first preset formula is:
  • f is the input of each self-attention operation unit
  • W q , W k and W v represent the mapping matrix
  • size(f) represents the dimension
  • the random deletion layer is used to randomly delete features according to a certain proportion.
  • the function of this layer is to prevent the system from overfitting.
  • the preset noise correction unit also called the correction module, has the main function of repairing noise, which is a key step in the noise reduction process. Its input is the output of the preset image and text feature fusion unit "fused coding features" and the output of the text encoder. "Text encoding features”, the output is the modified fusion features and the modified text encoding features, as shown in Figure 9.
  • the preset noise correction unit includes a first noise correction sub-unit and a second preset number of second noise correction sub-units.
  • the first noise correction subunit includes a self-attention layer, a layer normalization, a random deletion layer and an addition module, wherein in the self-attention layer, a preset self-attention is created based on the first preset formula mechanism;
  • the second noise correction subunit includes a cross-attention layer, a layer normalization, a random deletion layer and an addition module, wherein in the cross-attention layer, a preset cross-over is created based on the second preset formula attention mechanism.
  • the second preset formula is:
  • f and g respectively represent the two input features participating in the cross-attention operation each time in the cross-attention layer, and size(g) represents the dimension.
  • the preset noise correction unit weakens the noise between images and text by designing a cross-attention mechanism. For example, in Figure 5, the attention of the "red” in the text will be low because it cannot find the corresponding matching relationship in the image. After several layers of superposition, the model will pay more and more attention to "red”. Low, ultimately "red” noise is not enough to affect model performance. Therefore, this module is designed with a cross-attention layer and realizes the noise reduction function through repeated superposition.
  • the preset target frame correction unit also called the cross-coding module, is used to correct the features of the preset frame, and the corrected features are used to predict the regional position coordinates of the target visual object on the target image.
  • the preset target frame correction unit includes a first preset target frame correction subunit and a third preset number of second preset target frame correction subunits.
  • the first preset target frame correction subunit includes a self-attention layer, a layer normalization, a random deletion layer and an addition module, wherein in the self-attention layer, a preset is created based on the first preset formula Self-attention mechanism;
  • the second preset target frame correction subunit includes a cross-attention layer, a layer normalization, a random deletion layer and an addition module.
  • the cross-attention layer based on the second preset Let the formula create a default cross-attention mechanism.
  • this application proposes the fusion of coding features, because for the input of the cross-coding module, there are two fusion features that have the potential for visual localization: one is the "modified fusion feature" f modify modified by the modification module; the other is The output "corrected text features" of the correction module and the image coding features are encoded again using the fusion encoder to obtain the "noise reduction fusion feature" f denoise . Therefore, this application here provides three cross-encoding input settings. 1. Use f modify ; 2. Use f denoise ; 3. Splice or add features between the two.
  • f cat f modify + f denoise .
  • the coordinate predictor is constructed based on the first fully connected layer and the second fully connected layer. See Figure 11.
  • the first fully connected layer is a fully connected layer used to predict the confidence of the initial target frame.
  • the second fully connected layer The connection layer is a fully connected layer used to perform coordinate regression processing on the initial target frame.
  • determine the confidence of each initial target frame ; sort the confidence in descending order, and then select the initial target with the highest confidence from the descending sorting results.
  • frame and determine the coordinates of the filtered initial target frame as the regional position coordinates of the target visual object on the target image.
  • the box k with the highest confidence among the correction features size [n, d]
  • the k-th coordinate is output as the final output of the visual positioning system.
  • the embodiment of the present application also discloses a visual positioning device, as shown in Figure 12.
  • the device includes:
  • the feature splicing module 11 is used to encode the target image and the target text respectively, and perform feature splicing on the image encoding features and text encoding features obtained after encoding to obtain the spliced encoding features;
  • the first feature fusion module 12 is configured to use a preset image and text feature fusion unit built based on a preset self-attention mechanism to perform image and text feature fusion processing on the spliced coding features to obtain the first fused coding features;
  • the noise correction module 13 is configured to use a preset noise correction unit to perform image and text noise correction processing on the first fused coding feature and the text coding feature respectively to obtain the revised fusion feature and the revised text coding feature, so
  • the above-mentioned preset noise correction unit is constructed based on the preset self-attention mechanism and the preset cross-attention mechanism;
  • the second feature fusion module 14 is used to input the spliced coding features and the modified text coding features into the preset image and text feature fusion unit to obtain the second fused coding features;
  • the target frame correction unit 15 is configured to correct the preset frame features through the preset target frame correction unit and use the target encoding features determined based on the modified fusion features and the second fused encoding features, and The corrected frame features are used to predict the regional position coordinates of the target visual object on the target image.
  • the preset target frame correction unit is a unit built based on a preset self-attention mechanism and a preset cross-attention mechanism.
  • this application discloses a visual positioning method, which includes: encoding the target image and the target text respectively, and splicing the image encoding features and text encoding features obtained after encoding to obtain the spliced encoding features; using pre-coded It is assumed that the preset image and text feature fusion unit constructed from the attention mechanism performs image and text feature fusion processing on the spliced coding features to obtain the first fused coding features; the preset noise correction unit is used to perform image and text feature fusion processing on the first fused coding features respectively. The post-encoding features and the text encoding features are subjected to image and text noise correction processing to obtain the corrected fusion features and the corrected text encoding features.
  • the preset noise correction unit is based on a preset self-attention mechanism and a preset cross-attention mechanism.
  • the unit of mechanism construction input the spliced coding features and the modified text coding features into the preset image and text feature fusion unit to obtain the second fused coding features; pass the preset target frame correction unit, and use Based on the target coding features determined by the modified fusion features and the second fused coding features, the preset frame features are corrected, and the corrected frame features are used to predict the regional position of the target visual object on the target image. coordinates.
  • the preset target frame correction unit is a unit constructed based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • this application is constructed based on the preset self-attention mechanism and the preset cross-attention mechanism.
  • the noise correction unit is used to correct image and text noise. Since in the process of processing based on the cross-attention mechanism, the difference between the text and the image cannot find a matching relationship in the image, thus reducing the focus on the noise components of the image and text. degree, thereby weakening the impact of image and text noise on visual positioning accuracy, that is, achieving anti-noise visual positioning.
  • FIG. 13 is a structural diagram of the electronic device 20 according to an exemplary embodiment. The content in the figure cannot be considered as any limitation on the scope of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device 20 provided by an embodiment of the present application.
  • the electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a display screen 23, an input and output interface 24, a communication interface 25, a power supply 26, and a communication bus 27.
  • the memory 22 is used to store computer-readable instructions, which are loaded and executed by the processor 21 to implement relevant steps in the visual positioning method disclosed in any of the foregoing embodiments.
  • the electronic device 20 in this embodiment may specifically be an electronic computer.
  • the power supply 26 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 25 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be applicable Any communication protocol of the technical solution of this application is not specifically limited here; the input and output interface 24 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs. Here Not specifically limited.
  • the memory 22, as a carrier for resource storage can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored thereon can include computer-readable instructions 221, and the storage method can be short-term storage or permanent storage.
  • the computer-readable instructions 221 may further include computers that can be used to complete other specific tasks. Readable instructions.
  • embodiments of the present application also disclose one or more non-volatile computer-readable storage media storing computer-readable instructions, for storing computer-readable instructions; wherein the computer-readable instructions are stored in one or more computer-readable instructions.
  • the aforementioned disclosed visual positioning method is implemented.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种视觉定位方法、装置、设备及介质,涉及人工智能技术领域,该方法包括:对图像编码特征与文本编码特征进行特征拼接;对拼接后编码特征进行特征融合,得到第一融合后编码特征;基于预设跨注意力机制分别对第一融合后编码特征与文本编码特征进行噪声修正,得到修正后融合特征与修正后文本编码特征,对拼接后编码特征与修正后文本编码特征进行特征融合得到第二融合后编码特征;利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征对预设框特征进行修正,以预测目标视觉物体的区域位置坐标,可见,本申请基于预设跨注意力机制对图文噪声进行修正,通过降低对文本中噪声部分的关注度削弱了噪声的影响,实现抗噪视觉定位。

Description

一种视觉定位方法、装置、设备及介质
相关申请的交叉引用
本申请要求于2022年04月19日提交中国专利局,申请号为202210407177.8,申请名称为“一种视觉定位方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别涉及一种视觉定位方法、装置、设备及介质。
背景技术
近年来,多模态(Multi Modal,MM)成为人工智能领域非常重要的研究方向。由于其重视视觉和文本、语音等信息的融合,多模态相关的各种算法也层出不穷:基于卷积神经网络(Convolutional Neural Networks,CNN)、注意力机制(attention)的各种方法均各有广泛的适用面,均在诸如视觉常识推理(Visual Commonsense Reasoning,VCR)、视觉问答(Visual Question Answering,VQA)、视觉定位(Visual Grounding,VG)等领域成为主流方法。
视觉定位任务是多模态人工智能领域重要的研究方向之一,该任务旨在根据描述定位出图片中的相关物体并给出该物体的坐标位置。然而,现有视觉定位任务在落地过程中仍然存在一些问题,而这些问题在“实验室”研究中是极易被忽略的,比如“文本”中出现的错误,也称为噪声。文本错误是由“人”的原因导致的文本失真。通常情况下,人的口误、人在描述物体时产生的主观性偏差、描述语句的歧义等原因均会导致文本的错误,这些错误在日常生活中非常普遍,但在AI算法设计过程中很容易被忽略,这就成为现有方法和落地之间的障碍。简言之,当输入的文本存在某些错误的情况下,现有方法很难找到这句话本身想描述的物体并定位。
为此,避免因人类语言文本错误而产生的噪声对视觉定位的影响,实现抗噪视觉定位是本领域亟待解决的问题。
发明内容
本申请提供一种视觉定位方法,包括:
对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;
利用基于预设自注意力机制构建的预设图文特征融合单元对拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;
利用预设噪声修正单元分别对第一融合后编码特征以及文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,预设噪声修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元;
将拼接后编码特征以及修正后文本编码特征输入至预设图文特征融合单元以得到第二融合后编码特征;和
通过预设目标框修正单元,并利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在目标图像上的区域位置坐标,预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
在一个或多个实施例中,利用基于预设自注意力机制构建的预设图文特征融合单元对拼接后编码特征进行图文特征融合处理之前,还包括:
利用基于预设自注意力机制构建的自注意力运算单元、层归一化单元、特征删除单元以及特征相加单元构建图文特征融合子单元;
通过将第一预设数量的图文特征融合子单元进行依次串接,以构建得到预设图文特征融合单元;
相应的,利用基于预设自注意力机制构建的预设图文特征融合单元对拼接后编码特征进行图文特征融 合处理,以得到第一融合后编码特征,包括:
将预设图文特征融合单元中的第一个图文特征融合子单元作为当前图文特征融合子单元,并将拼接后编码特征作为待处理特征;
将待处理特征输入至当前图文特征融合子单元中;
利用当前图文特征融合子单元对待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的当前运算处理结果;和
根据当前运算处理结果获得第一融合后编码特征。
在一个或多个实施例中,根据当前运算处理结果获得第一融合后编码特征,包括:
判断当前图文特征融合子单元是否为最后一个;
响应于当前图文特征融合子单元不为最后一个,将当前图文特征融合子单元更新为下一个图文特征融合子单元,将待处理特征更新为当前运算处理结果,并返回执行将待处理特征输入至当前图文特征融合子单元中的步骤;和
响应于当前图文特征融合子单元为最后一个,将当前运算处理结果作为第一融合后编码特征。
在一个或多个实施例中,利用当前图文特征融合子单元对待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的当前运算处理结果,包括:
利用当前图文特征融合子单元中的自注意力运算单元对待处理特征进行自注意力运算,得到第一运算特征;
利用当前图文特征融合子单元中的层归一化单元对第一运算特征进行层归一化处理,得到第二运算特征;
利用当前图文特征融合子单元中的特征删除单元并按照预设比例对第二运算特征进行特征删除运算,以得到第三运算特征;和
利用当前图文特征融合子单元中的特征相加单元对第三运算特征与待处理特征进行特征相加运算,以得到当前图文特征融合子单元中的运算处理结果。
在一个或多个实施例中,利用预设噪声修正单元分别对第一融合后编码特征以及文本编码特征进行图文噪声修正处理之前,还包括:
利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一噪声修正子单元;
利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二噪声修正子单元;和
通过将第一噪声修正子单元以及第二预设数量的第二噪声修正子单元进行依次串接,以构建得到预设噪声修正单元。
在一个或多个实施例中,利用预设噪声修正单元分别对第一融合后编码特征以及文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,包括:
将第一融合后编码特征以及文本编码特征输入至预设噪声修正单元中的第一噪声修正子单元,以便对第一融合后编码特征以及文本编码特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到第一融合后编码特征以及文本编码特征各自对应的第一运算处理结果;
将预设噪声修正单元中的第一个第二噪声修正子单元作为当前第二噪声修正子单元,并将第一融合后编码特征以及文本编码特征各自对应的第一运算处理结果均作为当前的待处理特征;
将待处理特征输入至当前第二噪声修正子单元中;
利用当前第二噪声修正子单元对待处理特征依次进行跨注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到第一融合后编码特征以及文本编码特征各自对应的当前第二运算处理结果;和
根据当前第二运算处理结果获得修正后文本编码特征。
在一个或多个实施例中,根据当前第二运算处理结果获得修正后文本编码特征,包括:
判断当前第二噪声修正子单元是否为最后一个;
响应于当前第二噪声修正子单元不为最后一个,将当前第二噪声修正子单元更新为下一个第二噪声修正子单元,将待处理特征更新为当前第二运算处理结果,并返回执行将待处理特征输入至当前第二噪声修 正子单元中的步骤;和
响应于当前第二噪声修正子单元为最后一个,将第一融合后编码特征以及文本编码特征各自对应的当前第二运算处理结果分别作为修正后融合特征以及修正后文本编码特征。
在一个或多个实施例中,通过预设目标框修正单元,并利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理之前,还包括:
利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一目标框修正子单元;
利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二目标框修正子单元;和
通过将第一目标框修正子单元以及第三预设数量的第二目标框修正子单元进行依次串接,以构建得到预设目标框修正单元。
在一个或多个实施例中,通过预设目标框修正单元,并利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,包括:
将目标编码特征以及预设框特征输入至预设目标框修正单元中的第一目标框修正子单元,以便对目标编码特征以及预设框特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到目标编码特征以及预设框特征各自对应的第三运算处理结果;
将预设目标框修正单元中的第一个第二目标框修正子单元作为当前第二目标框修正子单元,并将目标编码特征以及预设框特征各自对应的第三运算处理结果均作为当前的待处理特征;
将待处理特征输入至当前第二目标框修正子单元中;
利用当前第二目标框修正子单元对待处理特征依次进行跨注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到目标编码特征以及预设框特征各自对应的当前第四运算处理结果;和
根据当前第四运算处理结果获得修正后框特征。
在一个或多个实施例中,根据当前第四运算处理结果获得修正后框特征,包括:
判断当前第二目标框修正子单元是否为最后一个;
响应于当前第二目标框修正子单元不为最后一个,将当前第二目标框修正子单元更新为下一个第二目标框修正子单元,将待处理特征更新为当前第四运算处理结果,并返回执行将待处理特征输入至当前第二目标框修正子单元中的步骤;和
响应于当前第二目标框修正子单元为最后一个,将当前第四运算处理结果作为修正后框特征。
在一个或多个实施例中,利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理之前,还包括:
将修正后融合特征或第二融合后编码特征或修正后融合特征与第二融合后编码特征进行预设运算处理后得到的特征确定为目标编码特征;其中,预设运算处理包括修正后融合特征与第二融合后编码特征进行特征相加或特征拼接。
在一个或多个实施例中,特征相加的公式为:
f cat=f modify+f denoise
其中,f modify为修正融合特征,f denoise为降噪融合特征,f cat为特征相加后的输出。
在一个或多个实施例中,特征拼接的公式为:
f cat=[f modify;f denoise];
其中,f modify为修正融合特征,f denoise为降噪融合特征,f cat为特征拼接后的输出。
在一个或多个实施例中,利用修正后框特征预测目标视觉物体在目标图像上的区域位置坐标,包括:
将修正后框特征输入至基于第一全连接层和第二全连接层构建的坐标预测器;第一全连接层为用于预测初始目标框的置信度的全连接层,第二全连接层为用于对初始目标框进行坐标回归处理的全连接层;
利用坐标预测器以及修正后框特征,确定每个初始目标框的置信度;和
根据每个初始目标框的置信度以及每个初始目标框的坐标确定目标视觉物体在目标图像上的区域位置坐标。
在一个或多个实施例中,根据每个初始目标框的置信度以及每个初始目标框的坐标确定目标视觉物体在目标图像上的区域位置坐标,包括:
对置信度进行降序排序,然后从降序排序结果中筛选置信度最高的初始目标框,并将筛选出的初始目标框的坐标确定为目标视觉物体在目标图像上的区域位置坐标。
在一个或多个实施例中,一种视觉定位方法还包括:
基于第一预设公式创建预设自注意力机制;
其中,第一预设公式为:
Figure PCTCN2022122335-appb-000001
其中,f为每个预设自注意力机制的输入,W q、W k以及W v表示映射矩阵,size(f)表示维度,attn self(f):表示预设自注意力机制的输出,softmax表示激活函数。
在一个或多个实施例中,一种视觉定位方法还包括:
基于第二预设公式创建预设跨注意力机制;
其中,第二预设公式为:
Figure PCTCN2022122335-appb-000002
其中,f,g分别表示预设跨注意力机制中跨注意力层中每次参与跨注意力运算的两个输入特征,size(g)表示维度,attn cross(f,g)表示预设跨注意力机制的输出,softmax表示激活函数。
本申请还提供一种视觉定位装置,包括:
特征拼接模块,用于对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;
第一特征融合模块,用于利用基于预设自注意力机制构建的预设图文特征融合单元对拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;
噪声修正模块,用于利用预设噪声修正单元分别对第一融合后编码特征以及文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,预设噪声修正单元基于预设自注意力机制和预设跨注意力机制构建而成;
第二特征融合模块,用于将拼接后编码特征以及修正后文本编码特征输入至预设图文特征融合单元以得到第二融合后编码特征;和
目标框修正单元,用于通过预设目标框修正单元,并利用基于修正后融合特征和第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在目标图像上的区域位置坐标,预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
本申请还提供一种电子设备,包括存储器及一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一项一种视觉定位方法的步 骤。
本申请还提供一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一项一种视觉定位方法的步骤。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请一个或多个实施例公开的一种视觉定位方法流程图;
图2为本申请一个或多个实施例公开的一种具体的视觉定位方法子流程图;
图3为本申请一个或多个实施例公开的一种具体的视觉定位方法子流程图;
图4为一种传统的视觉定位任务结构示意图;
图5为本申请一个或多个实施例公开的一种抗噪视觉定位任务结构示意图;
图6为一种传统的视觉定位方法流程图;
图7为本申请一个或多个实施例公开的一种具体的视觉定位方法流程图;
图8为本申请一个或多个实施例公开的一种预设图文特征融合单元结构示意图;
图9为本申请一个或多个实施例公开的一种预设噪声修正单元结构示意图;
图10为本申请一个或多个实施例公开的一种预设目标框修正单元结构示意图;
图11为本申请一个或多个实施例公开的一种坐标预测器结构示意图;
图12为本申请一个或多个实施例公开的一种视觉定位装置结构示意图;
图13为本申请一个或多个实施例公开的一种电子设备结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在视觉定位任务中,当输入的文本存在某些错误的情况下,现有方法很难找到这句话本身想描述的物体并定位。
为此,本申请实施例提出一种视觉定位方案,能够避免因人类语言文本错误而产生的噪声对视觉定位的影响,实现抗噪视觉定位。
本申请实施例公开了一种视觉定位方法,参见图1所示,该方法包括:
步骤S11:对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征。
本实施例中,对目标图像以及目标文本进行编码的编码器可以采用经典模型,例如:对目标图像进行编码的图像编码器可以采用卷积神经网络ResNet(Residual Neural Network)、ResNext等,对目标文本进行编码的图像编码器可以采用Roberta、Bert(Bidirectional Encoder Representation from Transformers)等。
本实施例中,将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征,如此一来,可以将所述图像编码特征以及所述文本编码特征作为一个整体输入至下一处理单元。
步骤S12:利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征。
本实施例中,在利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理之前,需要利用基于预设自注意力机制构建的自注意力运算单元、层归一化单元、特征删除单元以及特征相加单元构建图文特征融合子单元;通过将第一预设数量的所述图文特征融合子单元进行依次串接,以构建得到所述预设图文特征融合单元,所述特征删除单元用来按一定比例对特征进行随机删除,如此一来,可以防止系统过拟合。
本实施例中,在构建得到所述预设图文特征融合单元之后,将所述预设图文特征融合单元中的第一个图文特征融合子单元作为当前图文特征融合子单元,并将所述拼接后编码特征作为待处理特征;将所述待处理特征输入至当前图文特征融合子单元中;利用当前图文特征融合子单元对所述待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的当前运算处理结果;判断当前图文特征融合子单元是否为最后一个;若否,则将当前图文特征融合子单元更新为下一个图文特征融合子单元,将所述待处理特征更新为当前所述运算处理结果,并返回执行所述将所述待处理特征输入至当前图文特征融合子单元中的步骤;若是,则将当前所述运算处理结果作为所述第一融合后编码特征。
需要指出的是,利用当前图文特征融合子单元对所述待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的运算处理结果的过程具体包括:利用当前图文特征融合子单元中的所述自注意力运算单元对所述待处理特征进行自注意力运算,得到第一运算特征;利用当前图文特征融合子单元中的所述层归一化单元对所述第一运算特征进行层归一化处理,得到第二运算特征;利用当前图文特征融合子单元中的所述特征删除单元并按照预设比例对所述第二运算特征进行特征删除运算,以得到第三运算特征;利用当前图文特征融合子单元中的所述特征相加单元对所述第三运算特征与所述待处理特征进行特征相加运算,以得到当前图文特征融合子单元中的所述运算处理结果。
需要指出的是,本实施例在利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行第一次图文特征融合处理时,侧重于描述图文匹配关系,也即,基于所述图文匹配关系推测目标图像与目标文本之间不匹配的部分,也即噪声部分。
步骤S13:利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
需要指出的是,相较于传统的视觉定位方法,本实施例新增了基于预设自注意力机制和预设跨注意力机制构建的预设噪声修正单元,由此可以基于预设跨注意力机制对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以降低对文本中噪声部分的关注度,如此一来,削弱了噪声的影响,实现了抗噪视觉定位。本实施例中,利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理之前,还包括:利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一噪声修正子单元;利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二噪声修正子单元;通过将所述第一噪声修正子单元以及第二预设数量的所述第二噪声修正子单元进行依次串接,以构建得到所述预设噪声修正单元。
步骤S14:将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征。
本实施例中,在利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征以及所述修正后文本编码特征进行第二次图文特征融合处理时,侧重于融合图文特征,由前述公开的内容可知,在第一次图文特征融合时,确定了目标图像与目标文本之间不匹配的部分,也即噪声部分,因此本实施例在已知噪声部分的前提下侧重于融合图文特征。
步骤S15:通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
本实施例中,在对预设框特征进行修正处理之前,需要将所述修正后融合特征或所述第二融合后编码特征或所述修正后融合特征与所述第二融合后编码特征进行预设运算处理后得到的特征确定为所述目标编码特征;其中,所述预设运算处理包括所述修正后融合特征与所述第二融合后编码特征进行特征相加或特征拼接,如此一来,可以利用所述目标编码特征对预设框特征进行修正处理。
本实施例中,利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标具体包括:将所述修正后框特征输入至基于第一全连接层和第二全连接层构建的坐标预测器;所述第一全连接层为用于预测初始目标框的置信度的全连接层,第二全连接层为用于对所述初始目标框进行坐标回归处理的全连接层;利用所述坐标预测器以及所述修正后框特征,确定每个所述初始目标框的置信度;对所述置信度进行 降序排序,然后从降序排序结果中筛选置信度最高的所述初始目标框,并将筛选出的所述初始目标框的坐标确定为目标视觉物体在所述目标图像上的区域位置坐标。
本实施例中,在通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理之前,还包括:利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一目标框修正子单元;利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二目标框修正子单元;通过将所述第一目标框修正子单元以及第三预设数量的所述第二目标框修正子单元进行依次串接,以构建得到所述预设目标框修正单元。
可见,本申请公开了一种视觉定位方法,包括:对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元;将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征;通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元,可见,本申请是通过基于预设自注意力机制和预设跨注意力机制构建的噪声修正单元来进行图文噪声修正处理的,由于在基于跨注意力机制进行处理的过程中,文本相对于图像的差异在图像中无法找到匹配关系,从而降低了对图文噪声成分的关注度,由此削弱了图文噪声对视觉定位准确度的影响,也即实现了抗噪视觉定位。
进一步的,本实施例针对前述实施例步骤S13中的利用预设噪声修正单元分别对第一融合后编码特征以及文本编码特征进行图文噪声修正处理的过程,进行详细的介绍和说明。具体的,参见图2所示,上述步骤S13包括:
步骤S131:将所述第一融合后编码特征以及所述文本编码特征输入至所述预设噪声修正单元中的所述第一噪声修正子单元,以便对所述第一融合后编码特征以及所述文本编码特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述第一融合后编码特征以及所述文本编码特征各自对应的第一运算处理结果。
步骤S132:将所述预设噪声修正单元中的第一个第二噪声修正子单元作为当前第二噪声修正子单元,并将所述第一融合后编码特征以及所述文本编码特征各自对应的第一运算处理结果均作为当前的待处理特征。
步骤S133:将所述待处理特征输入至当前第二噪声修正子单元中。
步骤S134:利用当前第二噪声修正子单元对所述待处理特征依次进行跨注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述第一融合后编码特征以及所述文本编码特征各自对应的当前第二运算处理结果。
步骤S135:判断当前第二噪声修正子单元是否为最后一个。
步骤S136:若否,则将当前第二噪声修正子单元更新为下一个第二噪声修正子单元,将所述待处理特征更新为当前所述第二运算处理结果,并返回执行所述将所述待处理特征输入至当前第二噪声修正子单元中的步骤。
步骤S137:若是,则将所述第一融合后编码特征以及所述文本编码特征各自对应的当前所述第二运算处理结果分别作为所述修正后融合特征以及所述修正后文本编码特征。
也即,本实施例基于预设自注意力机制以及预设跨注意力机制对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以降低对文本中噪声部分的关注度,如此一来,削弱了噪声的影响,实现了抗噪视觉定位。
可见,本申请是通过基于预设自注意力机制和预设跨注意力机制构建的噪声修正单元来进行图文噪声修正处理的,由于在基于跨注意力机制进行处理的过程中,文本相对于图像的差异在图像中无法找到匹配 关系,从而降低了对图文噪声成分的关注度,由此削弱了图文噪声对视觉定位准确度的影响,也即实现了抗噪视觉定位。
进一步的,本实施例针对前述实施例步骤S15中的对预设框特征进行修正处理的过程进行详细的介绍和说明。参见图3所示,上述步骤S15中的对预设框特征进行修正处理的过程,具体包括:
步骤S151:将所述目标编码特征以及所述预设框特征输入至所述预设目标框修正单元中的所述第一目标框修正子单元,以便对所述目标编码特征以及所述预设框特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述目标编码特征以及所述预设框特征各自对应的第三运算处理结果。
步骤S152:将所述预设目标框修正单元中的第一个第二目标框修正子单元作为当前第二目标框修正子单元,并将所述目标编码特征以及所述预设框特征各自对应的第三运算处理结果均作为当前的待处理特征。
步骤S153:将所述待处理特征输入至当前第二目标框修正子单元中。
步骤S154:利用当前第二目标框修正子单元对所述待处理特征依次进行跨注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述目标编码特征以及所述预设框特征各自对应的当前第四运算处理结果。
步骤S155:判断当前第二目标框修正子单元是否为最后一个。
步骤S156:若否,则将当前第二目标框修正子单元更新为下一个第二目标框修正子单元,将所述待处理特征更新为当前所述第四运算处理结果,并返回执行所述将所述待处理特征输入至当前第二目标框修正子单元中的步骤。
步骤S157:若是,则将当前所述第四运算处理结果作为所述修正后框特征。
也即,本实施例中,基于预设自注意力机制与预设跨注意力机制构建得到所述预设目标框修正单元,如此一来,对预设框特征进行修正,修正后的特征用于预测目标视觉物体在所述目标图像上的区域位置坐标。
图4为当人类语言正确时的传统的视觉定位任务结构示意图,图5为本申请公开的当人类语言存在错误时的抗噪视觉定位任务结构示意图。
图5中的输入文本存在噪声,也即,图中女孩衣服的颜色被说成了红色。抗噪视觉定位任务需要在这种情况下,根据文本中“女孩”、“裙子”等正确的信息推测出“红色”为噪声,进而准确理解文本原本想要描述的目标,并定位其位置坐标。
图6为一种传统的视觉定位方法流程图,首先设置图像编码器和文本编码器对输入的图像以及文本进行编码,并将编码后得到的图像特征以及文本特征进行拼接,然后将拼接后得到的特征进行融合编码得到融合编码特征;设置一组预设框用来模拟框信息,通常,设置一个大小为[n,d]的零矩阵,n表示预设的数量,d表示特征的维度;对融合编码特征和预设框特征进行交叉编码,旨在对预设框特征进行修正;将修正后的框信息进行置信度预测和坐标位置回归,并将置信度最高的框坐标确定为视觉定位系统的最终输出。需要指出的是,由于传统的视觉定位方法在处理过程中无法解决文本中的噪声问题,因此当文本存在噪声时,利用传统的视觉定位方法进行视觉定位存在误差。
图7为本申请公开的一种具体的抗噪视觉定位方法流程图,较传统方法,本申请加入噪声修正单元来实现抗噪的目的,进而实现抗噪视觉定位的功能。本申请的视觉定位方法流程具体包括图像和文本的编码器、预设图文特征融合单元、预设噪声修正单元、交叉编码单元,坐标预测器。
(1)、预设图文特征融合单元
预设图文特征融合单元,也称融合编码器,用来融合图像和文本特征进而学习图文之间的匹配关系。该模块有两个作用,一方面可以用来编码图文之间的关系,因此对于文本带噪的情况来说,该编码器可实现图文之间的差异编码,最终用来生成图文匹配关系。另一方面可以用来进行图文特征融合,这对于视觉定位任务本身有意义,因此可放于最终的定位模块之前做编码器;该模块在本申请所涉及的系统中出现两次,第一次侧重描述图文匹配关系,第二次用来融合图文特征,由于两个功能并不冲突,为了节约算力,两个模块权重共享。
本申请所设计的预设图文特征融合单元结构示意图参见图8所示。首先,对图像编码特征f raw和文本编码特征g raw进行拼接;之后,将其输入到第一预设数量的图文特征融合子单元中进行编码,并得到最终的融合后编码特征。每个图文特征融合子单元包括一个自注意力层、一个层归一化、一个随机删除层、一个相加模块。在自注意力层中,基于第一预设公式创建预设自注意力机制。
所述第一预设公式为:
Figure PCTCN2022122335-appb-000003
其中,f为每个所述自注意力运算单元的输入,W q、W k以及W v表示映射矩阵,
Figure PCTCN2022122335-appb-000004
size(f)表示维度。
随机删除层用来按一定比例对特征进行随机删除,该层的作用是防止系统过拟合。
(2)、预设噪声修正单元
预设噪声修正单元,也称修正模块,主要功能是修复噪声,是降噪过程中的关键步骤,其输入为预设图文特征融合单元的输出“融合后编码特征”和文本编码器的输出“文本编码特征”,输出为修正后融合特征以及修正后文本编码特征,如图9所示。预设噪声修正单元包括一个第一噪声修正子单元以及第二预设数量的所述第二噪声修正子单元。第一噪声修正子单元包括一个自注意力层、一个层归一化、一个随机删除层以及一个相加模块,其中,在自注意力层中,基于第一预设公式创建预设自注意力机制;第二噪声修正子单元包括一个跨注意力层、一个层归一化、一个随机删除层以及一个相加模块,其中,在跨注意力层中,基于第二预设公式创建预设跨注意力机制。
所述第二预设公式为:
Figure PCTCN2022122335-appb-000005
其中,f,g分别表示跨注意力层中每次参与跨注意力运算的两个输入特征,size(g)表示维度。
预设噪声修正单元通过设计跨注意力机制对图文之间的噪声进行削弱。例如,对于图5来说,文本中的“红色”由于无法在图像中找到对应的匹配关系,其注意力会偏低,经过若干层的叠加,模型对“红色”的关注度会越来越低,最终“红色”的噪声不足以影响模型性能。因此,本模块设计了跨注意力层并通过反复叠加实现了降噪的功能。
(3)、预设目标框修正单元
预设目标框修正单元,也称交叉编码模块,用于对预设框特征进行修正,修正后的特征用于预测目标视觉物体在所述目标图像上的区域位置坐标。参见图10所示,预设目标框修正单元包括一个第一预设目标框修正子单元以及第三预设数量的所述第二预设目标框修正子单元。第一预设目标框修正子单元包括一个自注意力层、一个层归一化、一个随机删除层以及一个相加模块,其中,在自注意力层中,基于第一预设公式创建预设自注意力机制;第二预设目标框修正子单元包括一个跨注意力层、一个层归一化、一个随机删除层以及一个相加模块,其中,在跨注意力层中,基于第二预设公式创建预设跨注意力机制。
此外,本申请提出了编码特征的融合,因为对交叉编码模块的输入而言,有两个融合特征均具有视觉 定位的潜力:一是修正模块修正后的“修正融合特征”f modify;二是将修正模块的输出“修正文本特征”与图像编码特征再次使用融合编码器进行编码,得到的“降噪融合特征”f denoise。因此,本申请在此处,提供三种交叉编码的输入设置。1、使用f modify;2、使用f denoise;3、将二者进行特征拼接或特征相加。
所述特征拼接与特征相加的公式分别为:
f cat=[f modify;f denoise];
f cat=f modify+f denoise
(4)坐标预测器
坐标预测器基于第一全连接层和第二全连接层构建而成,参见图11所示,所述第一全连接层为用于预测初始目标框的置信度的全连接层,第二全连接层为用于对所述初始目标框进行坐标回归处理的全连接层。利用所述坐标预测器以及所述修正后框特征,确定每个所述初始目标框的置信度;对所述置信度进行降序排序,然后从降序排序结果中筛选置信度最高的所述初始目标框,并将筛选出的所述初始目标框的坐标确定为目标视觉物体在所述目标图像上的区域位置坐标。具体的,通过对置信度排序,可以得到修正特征(大小为[n,d])中置信度最高的框k,并将第k个坐标输出作为视觉定位系统的最终输出。
相应的,本申请实施例还公开了一种视觉定位装置,参见图12所示,该装置包括:
特征拼接模块11,用于对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;
第一特征融合模块12,用于利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;
噪声修正模块13,用于利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元基于预设自注意力机制和预设跨注意力机制构建而成;
第二特征融合模块14,用于将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征;
目标框修正单元15,用于通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
其中,关于上述各个模块更加具体的工作过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
可见,本申请公开了一种视觉定位方法,包括:对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元;将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征;通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设 跨注意力机制构建的单元,可见,本申请是通过基于预设自注意力机制和预设跨注意力机制构建的噪声修正单元来进行图文噪声修正处理的,由于在基于跨注意力机制进行处理的过程中,文本相对于图像的差异在图像中无法找到匹配关系,从而降低了对图文噪声成分的关注度,由此削弱了图文噪声对视觉定位准确度的影响,也即实现了抗噪视觉定位。
进一步的,本申请实施例还提供了一种电子设备。图13是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。
图13为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、显示屏23、输入输出接口24、通信接口25、电源26、和通信总线27。其中,所述存储器22用于存储计算机可读指令,所述计算机可读指令由所述处理器21加载并执行,以实现前述任一实施例公开的视觉定位方法中的相关步骤。另外,本实施例中的电子设备20具体可以为电子计算机。
本实施例中,电源26用于为电子设备20上的各硬件设备提供工作电压;通信接口25能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口24,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括计算机可读指令221,存储方式可以是短暂存储或者永久存储。其中,计算机可读指令221除了包括能够用于完成前述任一实施例公开的由电子设备20执行的视觉定位方法的计算机可读指令之外,还可以进一步包括能够用于完成其他特定工作的计算机可读指令。
进一步的,本申请实施例还公开了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,用于存储计算机可读指令;其中,所述计算机可读指令被一个或多个处理器执行时实现前述公开的视觉定位方法。
关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。
本申请书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上对本申请所提供的一种视觉定位方法、装置、设备、存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种视觉定位方法,其特征在于,包括:
    对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;
    利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;
    利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元;
    将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征;和
    通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
  2. 根据权利要求1所述的视觉定位方法,其特征在于,所述利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理之前,还包括:
    利用基于预设自注意力机制构建的自注意力运算单元、层归一化单元、特征删除单元以及特征相加单元构建图文特征融合子单元;
    通过将第一预设数量的所述图文特征融合子单元进行依次串接,以构建得到所述预设图文特征融合单元;
    相应的,所述利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征,包括:
    将所述预设图文特征融合单元中的第一个图文特征融合子单元作为当前图文特征融合子单元,并将所述拼接后编码特征作为待处理特征;
    将所述待处理特征输入至当前图文特征融合子单元中;
    利用当前图文特征融合子单元对所述待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的当前运算处理结果;和
    根据所述当前运算处理结果获得所述第一融合后编码特征。
  3. 根据权利要求2所述的视觉定位方法,其特征在于,所述根据所述当前运算处理结果获得所述第一融合后编码特征,包括:
    判断当前图文特征融合子单元是否为最后一个;
    响应于当前图文特征融合子单元不为最后一个,将当前图文特征融合子单元更新为下一个图文特征融合子单元,将所述待处理特征更新为当前所述运算处理结果,并返回执行所述将所述待处理特征输入至当前图文特征融合子单元中的步骤;和
    响应于当前图文特征融合子单元为最后一个,将当前所述运算处理结果作为所述第一融合后编码特征。
  4. 根据权利要求2所述的视觉定位方法,其特征在于,所述利用当前图文特征融合子单元对所述待处理特征依次进行自注意力运算、层归一化运算、特征删除运算以及特征相加运算,以得到相应的当前运算处理结果,包括:
    利用当前图文特征融合子单元中的所述自注意力运算单元对所述待处理特征进行自注意力运算,得到第一运算特征;
    利用当前图文特征融合子单元中的所述层归一化单元对所述第一运算特征进行层归一化处理,得到第二运算特征;
    利用当前图文特征融合子单元中的所述特征删除单元并按照预设比例对所述第二运算特征进行特征删除运算,以得到第三运算特征;和
    利用当前图文特征融合子单元中的所述特征相加单元对所述第三运算特征与所述待处理特征进行特征相加运算,以得到当前图文特征融合子单元中的所述运算处理结果。
  5. 根据权利要求1所述的视觉定位方法,其特征在于,所述利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理之前,还包括:
    利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一噪声修正子单元;
    利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二噪声修正子单元;和
    通过将所述第一噪声修正子单元以及第二预设数量的所述第二噪声修正子单元进行依次串接,以构建得到所述预设噪声修正单元。
  6. 根据权利要求5所述的视觉定位方法,其特征在于,所述利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,包括:
    将所述第一融合后编码特征以及所述文本编码特征输入至所述预设噪声修正单元中的所述第一噪声修正子单元,以便对所述第一融合后编码特征以及所述文本编码特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述第一融合后编码特征以及所述文本编码特征各自对应的第一运算处理结果;
    将所述预设噪声修正单元中的第一个第二噪声修正子单元作为当前第二噪声修正子单元,并将所述第一融合后编码特征以及所述文本编码特征各自对应的第一运算处理结果均作为当前的待处理特征;
    将所述待处理特征输入至当前第二噪声修正子单元中;
    利用当前第二噪声修正子单元对所述待处理特征依次进行跨注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述第一融合后编码特征以及所述文本编码特征各自对应的当前第二运算处理结果;和
    根据所述当前第二运算处理结果获得所述修正后文本编码特征。
  7. 根据权利要求6所述的视觉定位方法,其特征在于,所述根据所述当前第二运算处理结果获得所述修正后文本编码特征,包括:
    判断当前第二噪声修正子单元是否为最后一个;
    响应于当前第二噪声修正子单元不为最后一个,将当前第二噪声修正子单元更新为下一个第二噪声修正子单元,将所述待处理特征更新为当前所述第二运算处理结果,并返回执行所述将所述待处理特征输入至当前第二噪声修正子单元中的步骤;和
    响应于当前第二噪声修正子单元为最后一个,将所述第一融合后编码特征以及所述文本编码特征各自对应的当前所述第二运算处理结果分别作为所述修正后融合特征以及所述修正后文本编码特征。
  8. 根据权利要求1所述的视觉定位方法,其特征在于,所述通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理之前,还包括:
    利用基于预设自注意力机制构建的自注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第一目标框修正子单元;
    利用基于预设跨注意力机制构建的跨注意力运算单元、特征删除单元、层归一化单元、特征相加单元构建第二目标框修正子单元;和
    通过将所述第一目标框修正子单元以及第三预设数量的所述第二目标框修正子单元进行依次串接,以构建得到所述预设目标框修正单元。
  9. 根据权利要求8所述的视觉定位方法,其特征在于,所述通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,包括:
    将所述目标编码特征以及所述预设框特征输入至所述预设目标框修正单元中的所述第一目标框修正子单元,以便对所述目标编码特征以及所述预设框特征均分别进行自注意力运算、特征删除运算、层归一化运算以及特征相加运算,以得到所述目标编码特征以及所述预设框特征各自对应的第三运算处理结果;
    将所述预设目标框修正单元中的第一个第二目标框修正子单元作为当前第二目标框修正子单元,并将所述目标编码特征以及所述预设框特征各自对应的第三运算处理结果均作为当前的待处理特征;
    将所述待处理特征输入至当前第二目标框修正子单元中;
    利用当前第二目标框修正子单元对所述待处理特征依次进行跨注意力运算、特征删除运算、层归一化 运算以及特征相加运算,以得到所述目标编码特征以及所述预设框特征各自对应的当前第四运算处理结果;和
    根据所述当前第四运算处理结果获得所述修正后框特征。
  10. 根据权利要求9所述的视觉定位方法,其特征在于,所述根据所述当前第四运算处理结果获得所述修正后框特征,包括:
    判断当前第二目标框修正子单元是否为最后一个;
    响应于当前第二目标框修正子单元不为最后一个,将当前第二目标框修正子单元更新为下一个第二目标框修正子单元,将所述待处理特征更新为当前所述第四运算处理结果,并返回执行所述将所述待处理特征输入至当前第二目标框修正子单元中的步骤;和
    响应于当前第二目标框修正子单元为最后一个,将当前所述第四运算处理结果作为所述修正后框特征。
  11. 根据权利要求1所述的视觉定位方法,其特征在于,所述利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理之前,还包括:
    将所述修正后融合特征或所述第二融合后编码特征或所述修正后融合特征与所述第二融合后编码特征进行预设运算处理后得到的特征确定为所述目标编码特征;其中,所述预设运算处理包括所述修正后融合特征与所述第二融合后编码特征进行特征相加或特征拼接。
  12. 根据权利要求11所述的视觉定位方法,其特征在于,所述特征相加的公式为:
    f cat=f modify+f denoise
    其中,f modify为修正融合特征,f denoise为降噪融合特征,f cat为特征相加后的输出。
  13. 根据权利要求11所述的视觉定位方法,其特征在于,所述特征拼接的公式为:
    f cat=[f modify;f denoise];
    其中,f modify为修正融合特征,f denoise为降噪融合特征,f cat为特征拼接后的输出。
  14. 根据权利要求1至13任一项所述的视觉定位方法,其特征在于,所述利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,包括:
    将所述修正后框特征输入至基于第一全连接层和第二全连接层构建的坐标预测器;所述第一全连接层为用于预测初始目标框的置信度的全连接层,第二全连接层为用于对所述初始目标框进行坐标回归处理的全连接层;
    利用所述坐标预测器以及所述修正后框特征,确定每个所述初始目标框的置信度;和
    根据所述每个所述初始目标框的置信度以及每个所述初始目标框的坐标确定目标视觉物体在所述目标图像上的区域位置坐标。
  15. 根据权利要求14任一项所述的视觉定位方法,其特征在于,所述根据所述每个所述初始目标框的置信度以及每个所述初始目标框的坐标确定目标视觉物体在所述目标图像上的区域位置坐标,包括:
    对所述置信度进行降序排序,然后从降序排序结果中筛选置信度最高的所述初始目标框,并将筛选出的所述初始目标框的坐标确定为目标视觉物体在所述目标图像上的区域位置坐标。
  16. 根据权利要求1任一项所述的视觉定位方法,其特征在于,所述方法还包括:
    基于第一预设公式创建所述预设自注意力机制;
    其中,所述第一预设公式为:
    Figure PCTCN2022122335-appb-100001
    其中,f为每个所述预设自注意力机制的输入,W q、W k以及W v表示映射矩阵,size(f)表示 维度,attn self(f)表示所述预设自注意力机制的输出,softmax表示激活函数。
  17. 根据权利要求1任一项所述的视觉定位方法,其特征在于,所述方法还包括:
    基于第二预设公式创建所述预设跨注意力机制;
    其中,所述第二预设公式为:
    Figure PCTCN2022122335-appb-100002
    其中,f,g分别表示预设跨注意力机制中跨注意力层中每次参与跨注意力运算的两个输入特征,size(g)表示维度,attn cross(f,g)表示所述预设跨注意力机制的输出,softmax表示激活函数。
  18. 一种视觉定位装置,其特征在于,包括:
    特征拼接模块,用于对目标图像以及目标文本分别进行编码,并将编码后得到的图像编码特征以及文本编码特征进行特征拼接以得到拼接后编码特征;
    第一特征融合模块,用于利用基于预设自注意力机制构建的预设图文特征融合单元对所述拼接后编码特征进行图文特征融合处理,以得到第一融合后编码特征;
    噪声修正模块,用于利用预设噪声修正单元分别对所述第一融合后编码特征以及所述文本编码特征进行图文噪声修正处理,以得到修正后融合特征以及修正后文本编码特征,所述预设噪声修正单元基于预设自注意力机制和预设跨注意力机制构建而成;
    第二特征融合模块,用于将所述拼接后编码特征以及所述修正后文本编码特征输入至所述预设图文特征融合单元以得到第二融合后编码特征;和
    目标框修正单元,用于通过预设目标框修正单元,并利用基于所述修正后融合特征和所述第二融合后编码特征确定的目标编码特征,对预设框特征进行修正处理,并利用修正后框特征预测目标视觉物体在所述目标图像上的区域位置坐标,所述预设目标框修正单元为基于预设自注意力机制和预设跨注意力机制构建的单元。
  19. 一种电子设备,其特征在于,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1-17任意一项所述方法的步骤。
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1-17任意一项所述方法的步骤。
PCT/CN2022/122335 2022-04-19 2022-09-28 一种视觉定位方法、装置、设备及介质 WO2023201990A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210407177.8 2022-04-19
CN202210407177.8A CN114511472B (zh) 2022-04-19 2022-04-19 一种视觉定位方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2023201990A1 true WO2023201990A1 (zh) 2023-10-26

Family

ID=81554650

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122335 WO2023201990A1 (zh) 2022-04-19 2022-09-28 一种视觉定位方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN114511472B (zh)
WO (1) WO2023201990A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511472B (zh) * 2022-04-19 2022-07-08 苏州浪潮智能科技有限公司 一种视觉定位方法、装置、设备及介质
CN114821605B (zh) * 2022-06-30 2022-11-25 苏州浪潮智能科技有限公司 一种文本的处理方法、装置、设备和介质
CN115661727B (zh) * 2022-12-27 2023-04-28 苏州浪潮智能科技有限公司 视频的行为定位方法、装置、电子设备及存储介质
CN115761273B (zh) * 2023-01-10 2023-04-25 苏州浪潮智能科技有限公司 视觉常识推理方法和装置、存储介质及电子设备
CN115905591B (zh) * 2023-02-22 2023-05-30 浪潮电子信息产业股份有限公司 一种视觉问答方法、系统、设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800782A (zh) * 2021-01-29 2021-05-14 中国科学院自动化研究所 融合文本语义特征的语音翻译方法、系统、设备
CN113095435A (zh) * 2021-04-28 2021-07-09 平安科技(深圳)有限公司 视频描述生成方法、装置、设备及计算机可读存储介质
KR102279797B1 (ko) * 2021-03-05 2021-07-21 전남대학교산학협력단 멀티모달 데이터 융합 시스템 및 방법
CN113837102A (zh) * 2021-09-26 2021-12-24 广州华多网络科技有限公司 图文融合分类方法及其装置、设备、介质、产品
CN113850201A (zh) * 2021-09-28 2021-12-28 广州华多网络科技有限公司 跨模态商品分类方法及其装置、设备、介质、产品
CN114511472A (zh) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 一种视觉定位方法、装置、设备及介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114092758A (zh) * 2021-10-12 2022-02-25 中国科学院深圳先进技术研究院 一种基于文本的图像编辑方法和电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800782A (zh) * 2021-01-29 2021-05-14 中国科学院自动化研究所 融合文本语义特征的语音翻译方法、系统、设备
KR102279797B1 (ko) * 2021-03-05 2021-07-21 전남대학교산학협력단 멀티모달 데이터 융합 시스템 및 방법
CN113095435A (zh) * 2021-04-28 2021-07-09 平安科技(深圳)有限公司 视频描述生成方法、装置、设备及计算机可读存储介质
CN113837102A (zh) * 2021-09-26 2021-12-24 广州华多网络科技有限公司 图文融合分类方法及其装置、设备、介质、产品
CN113850201A (zh) * 2021-09-28 2021-12-28 广州华多网络科技有限公司 跨模态商品分类方法及其装置、设备、介质、产品
CN114511472A (zh) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 一种视觉定位方法、装置、设备及介质

Also Published As

Publication number Publication date
CN114511472B (zh) 2022-07-08
CN114511472A (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
WO2023201990A1 (zh) 一种视觉定位方法、装置、设备及介质
EP3828719A2 (en) Method and apparatus for generating model for representing heterogeneous graph node, electronic device, storage medium, and computer program product
US20230025317A1 (en) Text classification model training method, text classification method, apparatus, device, storage medium and computer program product
KR20210037619A (ko) 멀티 모달 콘텐츠 처리 방법, 장치, 기기 및 저장 매체
JP2023520420A (ja) チャットボットのために不均衡なトレーニングデータを取り扱うためのバッチング技術
WO2023201975A1 (zh) 一种差异描述语句生成方法、装置、设备及介质
KR102541053B1 (ko) 언어 모델에 기반한 단어 벡터 획득 방법, 장치, 기기 및 기록매체
JP2022006174A (ja) モデルをトレーニングするための方法、装置、デバイス、媒体、およびプログラム製品
CN111368545B (zh) 一种基于多任务学习的命名实体识别方法和装置
WO2020143186A1 (zh) 推荐系统训练方法、装置、计算机设备及存储介质
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
JP2014529787A (ja) 固有表現の認識方法及び装置
KR20210156223A (ko) 기계 번역 모델의 훈련 방법, 장치, 전자기기 및 저장 매체
WO2021228264A1 (zh) 一种应用机器学习的方法、装置、电子设备及存储介质
WO2023159746A1 (zh) 基于图像分割的图像抠图方法、装置、计算机设备及介质
WO2024001100A1 (zh) 一种文本的处理方法、装置、设备和非易失性可读存储介质
WO2024099342A1 (zh) 翻译方法、装置、可读介质及电子设备
CN116129452A (zh) 文档理解模型的生成方法、应用方法、装置、设备及介质
US11875116B2 (en) Machine learning models with improved semantic awareness
JP2021103558A (ja) 画像処理方法、画像処理装置、電子機器及び記憶媒体
KR20210139152A (ko) 의미적 유사성 모델의 훈련 방법, 장치, 전자 기기 및 기록 매체
US20230162041A1 (en) Neural network model, method, electronic device, and readable medium
CN113656544B (zh) 嵌套命名实体识别模型的训练方法、装置、设备和介质
WO2023159945A1 (zh) 多模态模型训练以及图像识别方法、装置、电子设备
US11893365B2 (en) Semantic design system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22938213

Country of ref document: EP

Kind code of ref document: A1