WO2024045444A1 - 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质 - Google Patents

一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质 Download PDF

Info

Publication number
WO2024045444A1
WO2024045444A1 PCT/CN2022/142512 CN2022142512W WO2024045444A1 WO 2024045444 A1 WO2024045444 A1 WO 2024045444A1 CN 2022142512 W CN2022142512 W CN 2022142512W WO 2024045444 A1 WO2024045444 A1 WO 2024045444A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
features
target detection
fusion
Prior art date
Application number
PCT/CN2022/142512
Other languages
English (en)
French (fr)
Inventor
李仁刚
张润泽
赵雅倩
郭振华
范宝余
李晓川
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2024045444A1 publication Critical patent/WO2024045444A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content

Definitions

  • the present application relates to the field of image processing technology, and in particular to a processing method, device, equipment and non-volatile readable storage medium for visual question and answer tasks.
  • Visual question answering means that given an image and a natural language question related to the image, the computer can generate a correct answer.
  • Visual question answering has now become an interesting test for evaluating the reasoning and generalization capabilities of computational models. It involves visual recognition, logic, arithmetic, spatial reasoning, intuitive physics, cause and effect, and multi-hop reasoning. It also requires the combination of two qualitatively different modes: image and language. High-dimensional visual modalities cover a lot of useless information, focusing attention on the information most relevant to the underlying reasoning problem. This also requires identifying key regions or objects and connecting them with the problem.
  • vision plays a very important role in multi-modal understanding tasks. Given a question, you need to find clues from vision to find the corresponding answer.
  • visual cues come from the semantic features of pictures: they mainly include two forms, one is directly from the image classification network; the other is from the coordinate frame obtained by target detection.
  • the current mainstream multimodal understanding models usually choose the second one. However, the current implementation does not take into account the trade-off between the quality and quantity of the detection coordinate frame.
  • the number of coordinate boxes can be limited using a classification confidence threshold, but this greatly depends on the classification confidence threshold. If the threshold is too small, then there are too many coordinate boxes and there is a lot of redundant information, which will undoubtedly add noise to the subsequent VQA (Visual Question Answer) model; if the threshold is too large, then the number of coordinate boxes is too few, which may Coordinate boxes that are directly related to the problem or indirectly related to the reasoning will be filtered out. Regarding the quality of the coordinate frame, only the coordinate frame that is directly or indirectly related to the problem can be called a high-quality coordinate frame. The visual cues extracted by traditional target detection based on classification confidence thresholds often have many redundant target frames, resulting in poor performance in visual question answering tasks.
  • the purpose of the embodiments of this application is to provide a processing method, device, equipment and non-volatile readable storage medium for visual question and answer tasks, which can improve the performance of visual question and answer tasks.
  • embodiments of the present application provide a processing method for visual question and answer tasks, including:
  • fusion features include the coordinate information of each detection frame
  • target detection frames that meet the correlation requirements are selected from the fusion features
  • selecting target detection frames that meet the correlation requirements from the fusion features includes:
  • selecting target detection frames that meet the correlation requirements from the fusion features includes:
  • the target detection model uses the trained target detection model to select target detection frames that meet the correlation requirements from the fusion features; among them, the target detection model is trained based on historical images and historical texts.
  • the method includes:
  • the initial detection model is trained to identify positive and negative samples
  • the loss function includes the initial loss function and the loss function corresponding to the positive and negative samples
  • the respective initialization weights of the language encoding module and the fusion module included in the initial detection model and the corresponding weight parameters of the initial detection model are adjusted to obtain a trained target detection model.
  • training the initial detection model to identify positive and negative samples includes:
  • the loss function corresponding to the positive and negative samples is determined
  • determine the loss function corresponding to the positive and negative samples including:
  • N the total number of samples
  • yi the value corresponding to the sample label of the i-th sample
  • yi 1 when the sample label is a positive sample
  • yi 0 when the sample label is a negative sample
  • w + represents a positive sample
  • p i the probability value of the i-th sample belonging to the positive sample
  • w - represents the threshold corresponding to the negative sample.
  • the method includes:
  • the initial visual question answering model is trained using the coordinate information, classification categories and semantic features corresponding to the positive samples to obtain the trained visual question answering model.
  • feature fusion processing is performed on the image to be analyzed and the first text, and the fused features obtained include:
  • the target detection module of the target detection model uses the target detection module of the target detection model to extract image features of the image to be analyzed; where the image features include image features corresponding to multiple detection frames;
  • the fusion module of the target detection model is used to fuse image features and text features to obtain fusion features.
  • the first text is the question text
  • the second text is the answer text matching the question text.
  • the first text is a plurality of question texts
  • the second text is an answer text that matches each question text
  • the target detection frames that meet the correlation requirements are selected from the fusion features, including:
  • feature fusion processing is performed on the image to be analyzed and the first text, and the fused features obtained include:
  • Fusion features are obtained by fusing image features and text features.
  • Embodiments of the present application also provide a processing device for visual question and answer tasks, including a fusion unit, a screening unit and a obtaining unit;
  • the fusion unit is used to perform feature fusion processing on the image to be analyzed and the first text to obtain fusion features; where the fusion features include coordinate information of each detection frame;
  • the screening unit is used to select target detection frames that meet the correlation requirements from the fusion features based on the correlation between the image to be analyzed and the first text;
  • the obtaining unit is used to input the coordinate information, classification categories and semantic features corresponding to the target detection frame into the trained visual question and answer model to obtain the second text that matches the first text; wherein the first text and the second text have a logical Correspondence.
  • the screening unit includes a calculation subunit and a selection subunit
  • the calculation subunit is used to calculate the intersection ratio of each image detection frame contained in the image feature of the image to be analyzed and the text detection frame corresponding to the text feature of the first text;
  • the selection subunit is used to select target detection frames whose intersection and union ratio are greater than a preset threshold from all image detection frames.
  • the screening unit is used to use the trained target detection model to select target detection frames that meet the correlation requirements from the fusion features; wherein the target detection model is trained based on historical images and historical texts.
  • the device includes a training unit, a discrimination unit, a calculation unit and an adjustment unit;
  • the training unit is used to train the initial detection model using the target detection data set to obtain the weight parameters corresponding to the initial detection model;
  • the discriminating unit is used to perform positive and negative sample discrimination training on the initial detection model based on the sample labels corresponding to each sample in the target detection data set;
  • the calculation unit is used to calculate the loss function of the initial detection model after completing the positive and negative sample discrimination training; wherein the loss function includes the initial loss function and the loss function corresponding to the positive and negative samples;
  • the adjustment unit is used to adjust the respective initialization weights of the language coding module and the fusion module included in the initial detection model and the corresponding weight parameters of the initial detection model based on the loss function of the initial detection model to obtain a trained target detection model.
  • the discrimination unit includes an identification subunit, a determination subunit and a parameter adjustment subunit;
  • the identification subunit is used to use the initial detection model to identify the probability value corresponding to each sample in the target detection data set;
  • the determination subunit is used to determine the loss function corresponding to the positive and negative samples based on the sample label and probability value corresponding to each sample in the target detection data set;
  • the parameter adjustment subunit is used to adjust the parameters corresponding to the fusion module in the initial detection model based on the loss function corresponding to positive and negative samples to complete the discrimination training of positive and negative samples.
  • the determination subunit is used to input the sample label and probability value corresponding to each sample in the target detection data set into the positive and negative sample loss function calculation formula to determine the loss function corresponding to the positive and negative samples; wherein, the positive and negative sample loss
  • the function calculation formula is:
  • N the total number of samples
  • yi the value corresponding to the sample label of the i-th sample
  • yi 1 when the sample label is a positive sample
  • yi 0 when the sample label is a negative sample
  • w + represents a positive sample
  • p i the probability value of the i-th sample belonging to the positive sample
  • w - represents the threshold corresponding to the negative sample.
  • the device includes a question and answer training unit;
  • the screening unit is also used to use the trained target detection model to screen out positive samples from the target detection data set;
  • the question and answer training unit is used to train the initial visual question and answer model using the coordinate information, classification categories and semantic features corresponding to the positive samples to obtain a trained visual question and answer model.
  • the fusion unit includes an extraction subunit, a coding subunit and a feature fusion subunit;
  • the extraction subunit is used to extract image features of the image to be analyzed by using the target detection module of the target detection model; wherein the image features include image features corresponding to multiple detection frames;
  • the encoding subunit is used to use the language encoding module of the target detection model to perform feature encoding on the first text to obtain text features;
  • the feature fusion subunit is used to use the fusion module of the target detection model to fuse image features and text features to obtain fusion features.
  • the first text is the question text
  • the second text is the answer text matching the question text.
  • the first text is a plurality of question texts
  • the second text is an answer text that matches each question text
  • the screening unit is used to use the trained target detection model to perform parallel analysis on the image to be analyzed and multiple question texts to obtain the target detection frame corresponding to each question text.
  • the fusion unit includes an extraction subunit, a coding subunit and a feature fusion subunit;
  • the extraction subunit is used to extract image features of the image to be analyzed; wherein the image features include image features corresponding to multiple detection frames;
  • the encoding subunit is used to encode features of the first text to obtain text features
  • the feature fusion subunit is used to fuse image features and text features to obtain fused features.
  • Embodiments of the present application also provide a terminal device, including a display screen, an input interface, and a processor connected to the display screen and the input interface respectively;
  • An input interface for receiving the image to be analyzed and the first text
  • a processor configured to perform feature fusion processing on the image to be analyzed and the first text to obtain fusion features; where the fusion features include coordinate information of each detection frame; and to filter out the fusion features based on the correlation between the image to be analyzed and the first text. Find a target detection frame that meets the correlation requirements; input the coordinate information, classification categories and semantic features corresponding to the target detection frame into the trained visual question and answer model to obtain a second text that matches the first text; where, the first text and The second text has a logical correspondence;
  • a display screen used to display the first text and its corresponding second text.
  • An embodiment of the present application also provides an electronic device, including:
  • Memory used to store computer programs
  • the processor is configured to execute a computer program to implement the steps of the above method for processing the visual question and answer task.
  • Embodiments of the present application also provide a non-volatile readable storage medium.
  • a computer program is stored on the non-volatile readable storage medium.
  • the steps of the processing method for the visual question and answer task are implemented as described above. .
  • the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features;
  • the fusion features include the coordinate information of each detection frame; each detection frame has its corresponding image information, and the fusion features
  • the number of corresponding detection frames is often large, and the detection frames include both detection frames with strong correlation with the first text and detection frames with weak correlation with the first text.
  • the target detection frames that meet the correlation requirements can be selected from the fusion features based on the correlation between the image to be analyzed and the first text; the coordinate information and classification categories corresponding to the target detection frames can be
  • the trained visual question and answer model is inputted with semantic features to obtain a second text that matches the first text; where the first text and the second text have a logical correspondence.
  • Figure 1 is a schematic diagram of a hardware composition framework suitable for a processing method for visual question and answer tasks provided by an embodiment of the present application;
  • Figure 2 is a schematic diagram of a hardware composition framework suitable for another visual question and answer task processing method provided by an embodiment of the present application;
  • Figure 3 is a flow chart of a method for processing a visual question and answer task provided by an embodiment of the present application
  • Figure 4 is a network structure diagram of a target detection model provided by an embodiment of the present application.
  • Figure 5 is a flow chart of a training method for a target detection model provided by an embodiment of the present application.
  • Figure 6 is a fusion module network structure diagram provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of parallel processing of different visual question and answer tasks on a mobile phone according to an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a processing device for a visual question and answer task provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a hardware composition framework suitable for a visual question and answer task processing method provided by an embodiment of the present application.
  • the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
  • a multimedia component 103 may include a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
  • I/O information input/information output
  • the processor 101 is used to control the overall operation of the electronic device 100 to complete all or part of the steps in the processing method of the visual question and answer task;
  • the memory 102 is used to store various types of data to support the operation of the electronic device 100.
  • the data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data.
  • the memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read- Only Memory (ROM), magnetic memory, flash memory, one or more of magnetic disks or optical disks.
  • SRAM static random access memory
  • EEPROM Electrically erasable programmable read-only memory
  • EPROM Erasable Programmable Read-Only Memory
  • PROM Programmable Read-Only Memory
  • ROM Read-Only Memory
  • magnetic memory flash memory
  • flash memory one or more of magnetic disks or optical disks.
  • the memory 102 stores at least programs and/or data for implementing the following functions:
  • fusion features include coordinate information of each detection frame
  • target detection frames that meet the correlation requirements are selected from the fusion features
  • Multimedia components 103 may include screen and audio components.
  • the screen may be a touch screen, for example, and the audio component is used to output and/or input audio signals.
  • the audio component may include a microphone for receiving external audio signals.
  • the received audio signals may be further stored in memory 102 or sent via communication component 105 .
  • the audio component also includes at least one speaker for outputting audio signals.
  • the information input/information output (I/O) interface 104 provides an interface between the processor 101 and other interface modules.
  • the other interface modules may be a keyboard, a mouse, a button, etc. These buttons can be virtual buttons or physical buttons.
  • the communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 105 may include: Wi-Fi parts, Bluetooth parts, NFC parts.
  • the electronic device 100 may be configured by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processor (Digital Signal Processor, DSP for short), digital signal processing device (Digital Signal Processing Device, DSPD for short), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), controller, microcontroller, microprocessor or other electronic components are used to perform visual question and answer tasks processing method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller, microprocessor or other electronic components are used to perform visual question and answer tasks processing method.
  • the structure of the electronic device 100 shown in FIG. 1 does not constitute a limitation on the electronic device in the embodiment of the present application.
  • the electronic device 100 may include more or fewer components than those shown in FIG. 1 , or a combination thereof. certain parts.
  • the number of electronic devices is not limited in the embodiment of the present application, and it can be a processing method in which multiple electronic devices cooperate to complete the visual question and answer task.
  • FIG. 2 is a schematic diagram of a hardware composition framework suitable for another visual question and answer task processing method provided by an embodiment of the present application.
  • the hardware composition framework may include: a first electronic device 11 and a second electronic device 12 , which are connected through a network 13 .
  • the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in FIG. 1 . That is, it can be understood that there are two electronic devices 100 in this embodiment, and the two perform data exchange.
  • the form of the network 13 is not limited in the embodiment of the present application. That is, the network 13 can be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.
  • the first electronic device 11 and the second electronic device 12 may be the same type of electronic device, for example, the first electronic device 11 and the second electronic device 12 may both be servers; they may also be different types of electronic devices, for example, the first electronic device 11 and the second electronic device 12 may be different types of electronic devices.
  • the device 11 may be a smartphone or other intelligent terminal, and the second electronic device 12 may be a server.
  • a server with strong computing power can be used as the second electronic device 12 to improve data processing efficiency and reliability, thereby improving the processing efficiency of model training and/or visual question answering.
  • a smartphone with low cost and wide application range is used as the first electronic device 11 to realize the interaction between the second electronic device 12 and the user.
  • the interaction process may be: the first electronic device 11 transmits the image to be analyzed and the first text to the second electronic device 12, and the second electronic device 12 performs feature fusion processing on the image to be analyzed and the first text, to obtain Fusion features; among them, the fusion features include the coordinate information of each detection frame; based on the correlation between the image to be analyzed and the first text, the target detection frames that meet the correlation requirements are selected from the fusion features; the coordinate information corresponding to the target detection frame is , classification categories and semantic features are input to the trained visual question and answer model to obtain a second text that matches the first text, thereby feeding back the second text to the first electronic device 11 .
  • Figure 3 is a flow chart of a method for processing a visual question and answer task provided by an embodiment of the present application. The method includes:
  • S301 Perform feature fusion processing on the image to be analyzed and the first text to obtain fusion features.
  • target detection pre-training weights are usually used to directly infer all pictures in the data set, and then the detection frame, classification, classification confidence, and extracted coordinate frame semantic features are calculated for each picture. Then select the corresponding detection frame by setting a threshold for classification confidence or setting the number of detection frames output for each image.
  • the detection frame refers to the location area of the target in the picture.
  • the target object may be a person or object associated with the text, or may be a person or object not associated with the text.
  • a picture contains a girl, a dog, and a sky. The girl, the dog, and the sky can all be used as targets.
  • the detection frame corresponding to the target object can include the location of the girl, the location of the dog, and the location of the sky. area.
  • the number of detection frames generated in the traditional method is often large.
  • the method of selecting detection frames according to the set threshold or the set number cannot select detection frames that have strong correlation with the text, resulting in the subsequent generation of visual question and answer models.
  • the answer corresponding to the text is also inappropriate.
  • feature fusion processing can be performed on the image to be analyzed and the first text to obtain the fusion features, so as to filter the detection frames based on the fusion features and delete the ones that are more relevant to the text. Weak detection box.
  • the image features of the image to be analyzed can be extracted; where the image features include image features corresponding to multiple detection frames.
  • Feature encoding is performed on the first text to obtain text features; image features and text features are fused to obtain fusion features.
  • the fused features contain coordinate information of each detection frame.
  • the image to be analyzed can be any picture
  • the first text can be a question raised about the image to be analyzed. For example, if the picture contains a girl and a dog sitting and playing on the beach, the first text could be "Where is the women sitting".
  • the fusion feature can be obtained based on the image features of the image to be analyzed and the text features of the first text.
  • Both image features and text features can be presented in the form of detection boxes.
  • the correlation between the image to be analyzed and the first text can be evaluated based on the IOU value (Intersection Over Union) between the detection frames.
  • intersection and union ratio of each image detection frame contained in the image feature and the text detection frame corresponding to the text feature can be calculated; a target detection frame whose intersection and union ratio is greater than a preset threshold is selected from all image detection frames.
  • the value of the preset threshold can be flexibly set according to actual needs, for example, it can be set to 0.5.
  • Each image detection frame is processed in a similar manner. Taking an image detection frame as an example, the IOU value of the image detection frame and the text detection frame can be calculated. The IOU value is greater than 0.5, indicating that the image detection frame has a strong correlation with the text detection frame and is a positive sample. At this time, the image detection frame can be used as a target detection frame to participate in the subsequent analysis process.
  • S303 Input the coordinate information, classification categories and semantic features corresponding to the target detection frame into the trained visual question and answer model to obtain a second text that matches the first text.
  • the first text and the second text have a logical correspondence.
  • the first text may be question text and the second text may be answer text.
  • the coordinate information, classification categories and semantic features corresponding to the target detection frame can be extracted through the Feed-Forward Network (FFN) module.
  • FNN Feed-Forward Network
  • the visual question and answer model can use VINVL (Visual representations in Vision-Language Models, visual representation in visual language model) or LXMERT (Learning Cross-Modality Encoder Representations from Transformers, learning cross-modal encoder representation from transformers).
  • VINVL Visual representations in Vision-Language Models, visual representation in visual language model
  • LXMERT Learning Cross-Modality Encoder Representations from Transformers, learning cross-modal encoder representation from transformers.
  • the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features;
  • the fusion features include the coordinate information of each detection frame; each detection frame has its corresponding image information, and the fusion features
  • the number of corresponding detection frames is often large, and the detection frames include both detection frames with strong correlation with the first text and detection frames with weak correlation with the first text.
  • the target detection frames that meet the correlation requirements can be selected from the fusion features based on the correlation between the image to be analyzed and the first text; the coordinate information and classification categories corresponding to the target detection frames can be
  • the trained visual question and answer model is inputted with semantic features to obtain a second text that matches the first text; where the first text and the second text have a logical correspondence.
  • the target detection model and the visual question and answer model can be combined to realize the processing of the visual question and answer task.
  • the target detection model can analyze the image to be analyzed and the first text, thereby screening out target detection frames that meet the correlation requirements, and extracting the coordinate information, classification categories and semantic features corresponding to the target detection frames.
  • the target detection model can be trained based on historical images and historical texts.
  • the trained target detection model can be used to perform feature fusion processing on the image to be analyzed and the first text, thereby obtaining the fusion features, and selecting target detection frames that meet the correlation requirements from the fusion features.
  • DETR Detection TRansformer, transformer-based target detection network
  • This model uses the recently popular transformer structure to transform target detection into a detection frame and a standard frame ( Ground Truth) binary matching problem.
  • FIG. 4 is a network structure diagram of a target detection model provided by an embodiment of the present application.
  • the target detection model includes a backbone network, an encoding module (Transformer encoder), a decoding module (Transformer decoder), a fusion module, and a forward propagation network module (Feed -Forward Network, FFN) and language encoding module (Roberta).
  • the backbone network, encoding module and decoding module can extract image features.
  • the language encoding module can extract text features.
  • the fusion module can fuse image features and text features to filter out target detection frames.
  • the forward propagation network module can extract the coordinate information, classification categories and semantic features of the target detection frame.
  • the backbone network, encoding module and decoding module can be used as the target detection module.
  • the target detection module of the target detection model is used to extract image features of the image to be analyzed; where the image features may include image features corresponding to multiple detection frames.
  • the language coding module of the target detection model is used to perform feature encoding on the first text to obtain text features; the fusion module of the target detection model is used to fuse the image features and text features to obtain fusion features.
  • FIG. 5 is a flow chart of a training method for the target detection model provided by an embodiment of the present application. The method includes:
  • S501 Use the target detection data set to train the initial detection model to obtain the weight parameters corresponding to the initial detection model.
  • Target detection data sets can include COCO (Common Objects in Context, image recognition, segmentation and image semantics data set) data set, Visual Genome (visual gene) data set, Obiects365 data set, etc.
  • COCO Common Objects in Context, image recognition, segmentation and image semantics data set
  • Visual Genome visual gene
  • Obiects365 data set etc.
  • the image first extracts features through the backbone network, that is, CNN (Convolutional Neural Networks, Convolutional Neural Network), and at the same time adds position coding features.
  • the position coding features are adaptively obtained according to the resolution of the image, and its meaning is to obtain the image Local location information of the feature map.
  • Use the encoder of the transformer to encode image features set the learnable initialization embedding parameter query, and decode the corresponding target position and classification from the encoded image features. These queries are equivalent to adaptive anchor (predefined anchor point for target detection) information, and the detection position and corresponding category of the corresponding object are decoded through the decoder.
  • Bipartite Matching bipartite maximum matching was introduced during the training process to complete the matching of the Ground Truth coordinate frame and the detection frame.
  • the matching strategy is as follows:
  • y i represents the Ground Truth coordinate frame
  • y i pred represents the detection frame.
  • the Hungarian matching algorithm is used to match the detection frame and the coordinate frame.
  • argmin means to use The values of yi and yi pred when reaching the minimum value.
  • L match represents the matching degree between the detection frame and the coordinate frame.
  • positive samples can be image features with strong correlation with the question
  • negative samples can be image features with weak correlation with the question.
  • the initial detection model can be used to identify the probability value corresponding to each sample in the target detection data set. The higher the probability value, the stronger the correlation between the image features contained in the sample and the problem.
  • the samples in the target detection data set can be detection frames corresponding to each picture in the target detection data set, and each detection frame has its corresponding image features.
  • the loss function corresponding to the positive and negative samples can be determined; based on the loss function corresponding to the positive and negative samples, adjust the parameters corresponding to the fusion module in the initial detection model to complete the positive and negative Sample discrimination training.
  • the positive and negative sample loss function For the determination of the positive and negative sample loss function, you can set the positive and negative sample loss function calculation formula, and input the sample label and probability value corresponding to each sample in the target detection data set into the positive and negative sample loss function calculation formula to determine the corresponding positive and negative sample
  • the loss function among them, the calculation formula of the positive and negative sample loss function is:
  • N the total number of samples
  • yi the value corresponding to the sample label of the i-th sample
  • yi 1 when the sample label is a positive sample
  • yi 0 when the sample label is a negative sample
  • w + represents a positive sample
  • p i the probability value of the i-th sample belonging to the positive sample
  • w - represents the threshold corresponding to the negative sample.
  • the loss function can include an initial loss function and a loss function corresponding to positive and negative samples.
  • the calculation method of the loss function corresponding to positive and negative samples can be found in the above introduction and will not be repeated here.
  • the initial loss function contains three items, the first item represents the classification loss, the second item represents the IOU loss, and the third item represents the L 1 loss.
  • y represents the ground truth coordinate frame
  • y pred represents the detection frame obtained by extracting image features
  • ⁇ i represents the serial number in the detection frame corresponding to the coordinate frame with ground truth serial number i.
  • p ⁇ (i) (c i ) represents the classification probability of the detection frame corresponding to the ground truth.
  • b i represents the coordinate position where the ground truth sequence is i, that is, [x1, y1, x2, y2].
  • b ⁇ (i) is the coordinate of the detection frame that matches the ground truth.
  • ⁇ iou and ⁇ 1 respectively represent the regression loss coefficient of the coordinate frame, and both can be set to 1 in this application.
  • L iou represents the IOU loss
  • L 1 represents the L 1 loss.
  • L 1 is as follows, which is the sum of the absolute values of the coordinates of the four points of the detection frame and Ground Truth:
  • this application uses a problem-based optimized target detection model that adds a language coding module and a fusion module.
  • the network structure diagram of the fusion module is shown in Figure 6.
  • the fusion module includes two single-modal transformer models (intra-attention), a cross-modal transformer model (cross-transformer), a linear layer and a positive and negative sample discrimination module. .
  • the linear layer can be connected to the FFN module of the target detection model.
  • the language encoding module will input the text features encoded by Roberta into the intra-transformer network module, and the decoding module will input the image features output by the DETR decoder into the intra-transformer network module. Then the fused features of the two modalities continue to pass through the cross-modal transformer (cross-transformer) network module, and finally the output fused features are input into the linear layer.
  • There are 100 query vectors preset in DETR which is equivalent to generating 100 detection frames.
  • each detection frame is given a positive sample or negative sample label based on the coordinate frame related to problem positioning.
  • the judgment criterion can be the IOU value of the coordinate frame obtained based on the detection and the coordinate frame related to the problem given in the Ground Truth in the GQA data set. If the IOU value of the two is greater than 0.5, it is determined as a positive sample, otherwise it is determined as a negative sample.
  • the method of adjusting model parameters based on the loss function is an existing relatively conventional method and will not be described again here.
  • the framework of the target detection model focuses on optimizing the process of extracting visual clues for target detection.
  • it can successfully detect the target detection frame that is directly related to the problem or indirectly related to the inference, which can greatly Delete redundant target detection frames in traditional solutions; from the perspective of visual question and answer task performance, visual cues are optimized, thereby greatly improving task performance.
  • the visual question and answer task processing solution provided by the embodiment of the present application can be easily applied to terminal equipment such as mobile phones and FPGA (Field-Programmable Gate Array) chips. Based on the functions that need to be implemented, it can be divided into an optimized visual clue module and a visual question and answer module.
  • the optimized visual clue module mainly consists of the backbone network, target detection module (including encoding module and decoding module) and MLP (Multilayer Perceptron, multi-layer perceptron) module (including fusion module and FFN module).
  • the backbone network uses the Swin Transformer structure
  • the target detection module uses the basic Transformer encoder and Transformer decoder modules
  • the MLP module is composed of a series of fully connected and matrix vector operations. Because all the Transformer and MLP networks are matrix multiplication and addition operations, parallel acceleration processing can be easily performed on hardware devices.
  • the first text may be multiple question texts
  • the second text may be an answer text that matches each question text.
  • the trained target detection model can be used to perform parallel analysis on the image to be analyzed and multiple question texts to obtain the target detection frame corresponding to each question text.
  • FIG. 7 is a schematic diagram of parallel processing of different visual question and answer tasks on the mobile phone provided by an embodiment of the present application.
  • Two models can be set up on the mobile phone, each model includes an optimized visual clue module and a visual Q&A module.
  • the function of optimizing the visual clue module is to give the question and the entire image, and output the partial image area related to the question and the classification of the corresponding area; for example, given the question "Is the person happy" and the entire image, the output is girl area and dog area, and output dog and girl.
  • the visual question and answer module takes the results obtained in the previous step as input together with the question, and infers the final answer "Yes”. For example, given the question "What is the weather like" and the entire image, the output is the sky area and Sky is output.
  • the visual question and answer module takes the result obtained in the previous step as input and the question, and infers the final answer "Sunny”.
  • Figure 8 is a schematic structural diagram of a visual question and answer task processing device provided by an embodiment of the present application, including a fusion unit 81, a screening unit 82 and a obtaining unit 83;
  • the fusion unit 81 is used to perform feature fusion processing on the image to be analyzed and the first text to obtain fusion features; wherein the fusion features include coordinate information of each detection frame;
  • the screening unit 82 is used to select target detection frames that meet the correlation requirements from the fusion features based on the correlation between the image to be analyzed and the first text;
  • Obtaining unit 83 is used to input the coordinate information, classification categories and semantic features corresponding to the target detection frame into the trained visual question and answer model to obtain a second text that matches the first text; wherein the first text and the second text have Logical correspondence.
  • the screening unit includes a calculation subunit and a selection subunit
  • the calculation subunit is used to calculate the intersection ratio of each image detection frame contained in the image feature of the image to be analyzed and the text detection frame corresponding to the text feature of the first text;
  • the selection subunit is used to select target detection frames whose intersection and union ratio are greater than a preset threshold from all image detection frames.
  • the screening unit is used to use the trained target detection model to select target detection frames that meet the correlation requirements from the fusion features; wherein the target detection model is trained based on historical images and historical texts.
  • the device includes a training unit, a discrimination unit, a calculation unit and an adjustment unit;
  • the training unit is used to train the initial detection model using the target detection data set to obtain the weight parameters corresponding to the initial detection model;
  • the discrimination unit is used to train the initial detection model to discriminate between positive and negative samples based on the sample labels corresponding to each sample in the target detection data set;
  • the calculation unit is used to calculate the loss function of the initial detection model after completing the positive and negative sample discrimination training; wherein the loss function includes the initial loss function and the loss function corresponding to the positive and negative samples;
  • the adjustment unit is used to adjust the respective initialization weights of the language encoding module and the fusion module included in the initial detection model and the corresponding weight parameters of the initial detection model based on the loss function of the initial detection model to obtain a trained target detection model.
  • the discrimination unit includes an identification subunit, a determination subunit and a parameter adjustment subunit;
  • the identification subunit is used to use the initial detection model to identify the probability value corresponding to each sample in the target detection data set;
  • the determination subunit is used to determine the loss function corresponding to the positive and negative samples based on the sample label and probability value corresponding to each sample in the target detection data set;
  • the parameter adjustment subunit is used to adjust the parameters corresponding to the fusion module in the initial detection model based on the loss function corresponding to positive and negative samples to complete the discrimination training of positive and negative samples.
  • the determination subunit is used to input the sample label and probability value corresponding to each sample in the target detection data set into the positive and negative sample loss function calculation formula to determine the loss function corresponding to the positive and negative samples; wherein, the positive and negative sample loss
  • the function calculation formula is:
  • N the total number of samples
  • yi the value corresponding to the sample label of the i-th sample
  • yi 1 when the sample label is a positive sample
  • yi 0 when the sample label is a negative sample
  • w + represents a positive sample
  • p i the probability value of the i-th sample belonging to the positive sample
  • w - represents the threshold corresponding to the negative sample.
  • the device includes a question and answer training unit;
  • the screening unit is also used to use the trained target detection model to screen out positive samples from the target detection data set;
  • the question and answer training unit is used to train the initial visual question and answer model using the coordinate information, classification categories and semantic features corresponding to the positive samples to obtain a trained visual question and answer model.
  • the fusion unit includes an extraction subunit, a coding subunit and a feature fusion subunit;
  • the extraction subunit is used to extract image features of the image to be analyzed by using the target detection module of the target detection model; wherein the image features include image features corresponding to multiple detection frames;
  • the encoding subunit is used to use the language encoding module of the target detection model to perform feature encoding on the first text to obtain text features;
  • the feature fusion subunit is used to use the fusion module of the target detection model to fuse image features and text features to obtain fusion features.
  • the first text is the question text
  • the second text is the answer text matching the question text.
  • the first text is a plurality of question texts
  • the second text is an answer text that matches each question text
  • the screening unit is used to use the trained target detection model to perform parallel analysis on the image to be analyzed and multiple question texts to obtain the target detection frame corresponding to each question text.
  • the fusion unit includes an extraction subunit, a coding subunit and a feature fusion subunit;
  • the extraction subunit is used to extract image features of the image to be analyzed; wherein the image features include image features corresponding to multiple detection frames;
  • the encoding subunit is used to encode features of the first text to obtain text features
  • the feature fusion subunit is used to fuse image features and text features to obtain fused features.
  • the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features;
  • the fusion features include the coordinate information of each detection frame; each detection frame has its corresponding image information, and the fusion features
  • the number of corresponding detection frames is often large, and the detection frames include both detection frames with strong correlation with the first text and detection frames with weak correlation with the first text.
  • the target detection frames that meet the correlation requirements can be selected from the fusion features based on the correlation between the image to be analyzed and the first text; the coordinate information and classification categories corresponding to the target detection frames can be
  • the trained visual question and answer model is inputted with semantic features to obtain a second text that matches the first text; where the first text and the second text have a logical correspondence.
  • Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application, including a display screen 91, an input interface 92, and a processor connected to the display screen 91 and the input interface 92 respectively; since the processor is built into the terminal device, The processor is not shown in Figure 9.
  • Input interface 92 used to receive the image to be analyzed and the first text
  • a processor configured to perform feature fusion processing on the image to be analyzed and the first text to obtain fusion features; where the fusion features include coordinate information of each detection frame; and to filter out the fusion features based on the correlation between the image to be analyzed and the first text. Find a target detection frame that meets the correlation requirements; input the coordinate information, classification categories and semantic features corresponding to the target detection frame into the trained visual question and answer model to obtain a second text that matches the first text; where, the first text and The second text has a logical correspondence;
  • the display screen 91 is used to display the first text and its corresponding second text.
  • the input interface 92 can be used to connect with external devices such as USB flash drives. There can be multiple input interfaces.
  • Figure 9 takes one input interface as an example.
  • the user can input the image to be analyzed and the first text to the terminal device through the input keyboard, or write the image to be analyzed and the first text to a USB flash drive, and insert the USB flash drive into the input interface 92 of the terminal device.
  • the terminal device can transmit the image to be analyzed and the first text to the processor.
  • the processor can obtain a third image that matches the first text. Second text, at this time the terminal device can display the second text through the display screen 91.
  • the functional modules such as the display screen 91, the input interface 92, and the processor included in the terminal device in Figure 9 are only examples. In actual applications, the terminal device may also include more or less functions based on actual needs. module, there is no limit to this.
  • the image to be analyzed and the first text are subjected to feature fusion processing to obtain fusion features;
  • the fusion features include the coordinate information of each detection frame; each detection frame has its corresponding image information, and the fusion features
  • the number of corresponding detection frames is often large, and the detection frames include both detection frames with strong correlation with the first text and detection frames with weak correlation with the first text.
  • the target detection frames that meet the correlation requirements can be selected from the fusion features based on the correlation between the image to be analyzed and the first text; the coordinate information and classification categories corresponding to the target detection frames can be
  • the trained visual question and answer model is inputted with semantic features to obtain a second text that matches the first text; where the first text and the second text have a logical correspondence.
  • the processing method of the visual question and answer task in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , execute all or part of the steps of the methods of various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (Random Access Memory, RAM), electrically erasable programmable ROM, register, hard disk, removable memory.
  • Various media that can store program code such as removable disks, CD-ROMs, magnetic disks or optical disks.
  • embodiments of the present application also provide a non-volatile readable storage medium.
  • a computer program is stored on the non-volatile readable storage medium.
  • the computer program is executed by the processor, the processing of the above visual question and answer task is implemented. Method steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及图像处理技术领域,公开了一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质,对待分析图像和第一文本进行特征融合处理,得到融合特征;融合特征包含各检测框的坐标信息。依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。通过对待分析图像和第一文本进行特征融合处理,可以实现对待分析图像和第一文本的综合分析。基于相关性对检测框进行删减,有效的降低了无效检测框造成的干扰,减少了视觉问答模型的计算量,提升了视觉问答任务的性能。

Description

一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质
相关申请的交叉引用
本申请要求于2022年9月2日提交中国专利局,申请号为202211068333.9,申请名称为“一种视觉问答任务的处理方法、装置、设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,特别是涉及一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质。
背景技术
视觉问答指的是给定一张图像和一个与该图像有关的自然语言问题,计算机能产生一个正确的回答。视觉问答目前已经成为评估计算模型的推理能力和泛化能力的一个有趣测试。它涉及视觉识别、逻辑、算术、空间推理、直观物理、因果关系和多跳推理。它还需要结合两种不同性质的模式:图像和语言。高维的视觉模态涵盖很多无用信息,将注意力集中在与潜在推理问题最相关的信息上,这也需要识别关键区域或对象,并将他们及问题一同联系起来。
通常来说多模态理解任务中视觉起到了很重要的作用,给定一个问题,需要从视觉中找到线索,从而才能找到对应的答案。通常来说视觉线索来自于图片的语义特征:主要包含两种形式,一种是直接来自于图像分类网络;另一种则是来自于目标检测得到的坐标框。当前主流的多模态理解模型通常选择第二种。但是目前的实现方式并没有考虑到检测坐标框的质量及数量的权衡关系。
通常来说,使用分类置信度阈值可以限定坐标框的数量,但这样极大程度依赖于分类置信度的阈值。如果阈值太小,那么坐标框数量太多,存在很多冗余信息,这样无疑对后面的VQA(Visual Question Answer,视觉问答)模型增加了噪声;如果阈值太大,那么坐标框数量太少,可能会出现与问题直接或推理间接相关的坐标框被过滤掉。对于坐标框的质量,只有与问题直接或者间接相关的坐标框才能被称为是优质的坐标框。传统的目标检测根据分类置信度阈值提取的视觉线索往往存在较多冗余的目标框,导致视觉问答任务的性能较差。
可见,如何提升视觉问答任务的性能,是本领域技术人员需要解决的问题。
发明内容
本申请实施例的目的是提供一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质,可以提升视觉问答任务的性能。
为解决上述技术问题,本申请实施例提供一种视觉问答任务的处理方法,包括:
对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各 检测框的坐标信息;
依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;
将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。
可选地,依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框包括:
计算待分析图像的图像特征中包含的各图像检测框与第一文本的文本特征对应的文本检测框的交并比;
从所有图像检测框中选取出交并比大于预设阈值的目标检测框。
可选地,依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框包括:
利用训练好的目标检测模型从融合特征中筛选出满足相关性要求的目标检测框;其中,目标检测模型基于历史图像和历史文本训练得到。
可选地,针对于目标检测模型的训练过程,方法包括:
利用目标检测数据集训练初始检测模型,以得到初始检测模型对应的权重参数;
基于目标检测数据集中各样本对应的样本标签,对初始检测模型进行正负样本判别训练;
在完成正负样本判别训练后,计算初始检测模型的损失函数;其中,损失函数包括初始损失函数和正负样本对应的损失函数;
依据初始检测模型的损失函数,对初始检测模型中包含的语言编码模块和融合模块各自的初始化权重以及初始检测模型对应的权重参数进行调整,得到训练好的目标检测模型。
可选地,基于目标检测数据集中各样本对应的样本标签,对初始检测模型进行正负样本判别训练包括:
利用初始检测模型识别目标检测数据集中各样本对应的概率值;
依据目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数;
基于正负样本对应的损失函数,调整初始检测模型中融合模块对应的参数,以完成正负样本判别训练。
可选地,依据目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数包括:
将目标检测数据集中各样本对应的样本标签以及概率值输入至正负样本损失函数计算公式,以确定出正负样本对应的损失函数;其中,正负样本损失函数计算公式为:
Figure PCTCN2022142512-appb-000001
其中,N表示样本总个数,y i表示第i个样本的样本标签对应的数值,样本标签为正样本时y i=1,样本标签为负样本时y i=0,w +表示正样本对应的阈值,p i表示第i个样本属于正样本的概率值,w -表示负样本对应的阈值。
可选地,针对于视觉问答模型的训练过程,方法包括:
利用训练好的目标检测模型从目标检测数据集中筛选出正样本;
利用正样本对应的坐标信息、分类类别和语义特征对初始视觉问答模型进行训练,以得到训练好的视觉问答模型。
可选地,对待分析图像和第一文本进行特征融合处理,得到融合特征包括:
利用目标检测模型的目标检测模块提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
利用目标检测模型的语言编码模块对第一文本进行特征编码,得到文本特征;
利用目标检测模型的融合模块将图像特征与文本特征进行融合,得到融合特征。
可选地,第一文本为问题文本;第二文本为与问题文本匹配的答案文本。
可选地,第一文本为多个问题文本,第二文本为与各问题文本各自匹配的答案文本;
相应的,依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框包括:
利用训练好的目标检测模型对待分析图像以及多个问题文本进行并行分析,以得到各问题文本各自对应的目标检测框。
可选地,对待分析图像和第一文本进行特征融合处理,得到融合特征包括:
提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
对第一文本进行特征编码,得到文本特征;
将图像特征与文本特征进行融合,得到融合特征。
本申请实施例还提供了一种视觉问答任务的处理装置,包括融合单元、筛选单元和得到单元;
融合单元,用于对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;
筛选单元,用于依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;
得到单元,用于将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。
可选地,筛选单元包括计算子单元和选取子单元;
计算子单元,用于计算待分析图像的图像特征中包含的各图像检测框与第一文本的文本特征对应的文本检测框的交并比;
选取子单元,用于从所有图像检测框中选取出交并比大于预设阈值的目标检测框。
可选地,筛选单元用于利用训练好的目标检测模型从融合特征中筛选出满足相关性要求的目标检测框;其中,目标检测模型基于历史图像和历史文本训练得到。
可选地,针对于目标检测模型的训练过程,装置包括训练单元、判别单元、计算单元和调整单元;
训练单元,用于利用目标检测数据集训练初始检测模型,以得到初始检测模型对应的权重参数;
判别单元,用于基于目标检测数据集中各样本对应的样本标签,对初始检测模型进行正负样本判别训练;
计算单元,用于在完成正负样本判别训练后,计算初始检测模型的损失函数;其中,损失函数包括初始损失函数和正负样本对应的损失函数;
调整单元,用于依据初始检测模型的损失函数,对初始检测模型中包含的语言编码模 块和融合模块各自的初始化权重以及初始检测模型对应的权重参数进行调整,得到训练好的目标检测模型。
可选地,判别单元包括识别子单元、确定子单元和参数调整子单元;
识别子单元,用于利用初始检测模型识别目标检测数据集中各样本对应的概率值;
确定子单元,用于依据目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数;
参数调整子单元,用于基于正负样本对应的损失函数,调整初始检测模型中融合模块对应的参数,以完成正负样本判别训练。
可选地,确定子单元用于将目标检测数据集中各样本对应的样本标签以及概率值输入至正负样本损失函数计算公式,以确定出正负样本对应的损失函数;其中,正负样本损失函数计算公式为:
Figure PCTCN2022142512-appb-000002
其中,N表示样本总个数,y i表示第i个样本的样本标签对应的数值,样本标签为正样本时y i=1,样本标签为负样本时y i=0,w +表示正样本对应的阈值,p i表示第i个样本属于正样本的概率值,w -表示负样本对应的阈值。
可选地,针对于视觉问答模型的训练过程,装置包括问答训练单元;
筛选单元还用于利用训练好的目标检测模型从目标检测数据集中筛选出正样本;
问答训练单元,用于利用正样本对应的坐标信息、分类类别和语义特征对初始视觉问答模型进行训练,以得到训练好的视觉问答模型。
可选地,融合单元包括提取子单元、编码子单元和特征融合子单元;
提取子单元,用于利用目标检测模型的目标检测模块提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
编码子单元,用于利用目标检测模型的语言编码模块对第一文本进行特征编码,得到文本特征;
特征融合子单元,用于利用目标检测模型的融合模块将图像特征与文本特征进行融合,得到融合特征。
可选地,第一文本为问题文本;第二文本为与问题文本匹配的答案文本。
可选地,第一文本为多个问题文本,第二文本为与各问题文本各自匹配的答案文本;
相应的,筛选单元用于利用训练好的目标检测模型对待分析图像以及多个问题文本进行并行分析,以得到各问题文本各自对应的目标检测框。
可选地,融合单元包括提取子单元、编码子单元和特征融合子单元;
提取子单元,用于提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
编码子单元,用于对第一文本进行特征编码,得到文本特征;
特征融合子单元,用于将图像特征与文本特征进行融合,得到融合特征。
本申请实施例还提供了一种终端设备,包括显示屏,输入接口,以及分别与显示屏、输入接口连接的处理器;
输入接口,用于接收待分析图像和第一文本;
处理器,用于对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融 合特征包含各检测框的坐标信息;依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系;
显示屏,用于展示第一文本及其对应的第二文本。
本申请实施例还提供了一种电子设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行计算机程序以实现如上述视觉问答任务的处理方法的步骤。
本申请实施例还提供了一种非易失性可读存储介质,非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述视觉问答任务的处理方法的步骤。
由上述技术方案可以看出,对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;每个检测框有其对应的图像信息,融合特征中所对应的检测框数量往往较多,检测框中既包含与第一文本具有较强相关性的检测框,也包含与第一文本具有较弱相关性的检测框。为了能够删除相关性较弱的检测框,可以依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。在该技术方案中,通过对待分析图像和第一文本进行特征融合处理,可以实现对待分析图像和第一文本的综合分析。基于相关性对检测框进行删减,有效的降低了无效检测框造成的干扰,减少了视觉问答模型的计算量,提升了视觉问答任务的性能。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种视觉问答任务的处理方法所适用的硬件组成框架示意图;
图2为本申请实施例提供的另一种视觉问答任务的处理方法所适用的硬件组成框架示意图;
图3为本申请实施例提供的一种视觉问答任务的处理方法的流程图;
图4为本申请实施例提供的一种目标检测模型的网络结构图;
图5为本申请实施例提供的一种目标检测模型的训练方法的流程图;
图6为本申请实施例提供的一种融合模块网络结构图;
图7为本申请实施例提供的一种在手机端并行处理不同的视觉问答任务的示意图;
图8为本申请实施例提供的一种视觉问答任务的处理装置的结构示意图;
图9为本申请实施例提供的一种终端设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部实施例。基于本申 请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本申请保护范围。
本申请的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。
为了便于理解,先对本申请实施例提供的视觉问答任务的处理方法对应的方案所使用的硬件组成框架进行介绍。请参考图1,图1为本申请实施例提供的一种视觉问答任务的处理方法所适用的硬件组成框架示意图。其中电子设备100可以包括处理器101和存储器102,还可以进一步包括多媒体组件103、信息输入/信息输出(I/O)接口104以及通信组件105中的一种或多种。
其中,处理器101用于控制电子设备100的整体操作,以完成视觉问答任务的处理方法中的全部或部分步骤;存储器102用于存储各种类型的数据以支持在电子设备100的操作,这些数据例如可以包括用于在该电子设备100上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,SRAM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、只读存储器(Read-Only Memory,ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。在本实施例中,存储器102中至少存储有用于实现以下功能的程序和/或数据:
对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;
依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;
将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。
多媒体组件103可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或通过通信组件105发送。音频组件还包括至少一个扬声器,用于输出音频信号。信息输入/信息输出(I/O)接口104为处理器101和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件105用于电子设备100与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件105可以包括:Wi-Fi部件,蓝牙部件,NFC部件。
电子设备100可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控 制器、微控制器、微处理器或其他电子元件实现,用于执行视觉问答任务的处理方法。
当然,图1所示的电子设备100的结构并不构成对本申请实施例中电子设备的限定,在实际应用中电子设备100可以包括比图1所示的更多或更少的部件,或者组合某些部件。
可以理解的是,本申请实施例中并不对电子设备的数量进行限定,其可以是多个电子设备共同协作完成视觉问答任务的处理方法。在一种可能的实施方式中,请参考图2,图2为本申请实施例提供的另一种视觉问答任务的处理方法所适用的硬件组成框架示意图。由图2可知,该硬件组成框架可以包括:第一电子设备11和第二电子设备12,二者之间通过网络13连接。
在本申请实施例中,第一电子设备11与第二电子设备12的硬件结构可以参考图1中电子设备100。即可以理解为本实施例中具有两个电子设备100,两者进行数据交互。进一步,本申请实施例中并不对网络13的形式进行限定,即,网络13可以是无线网络(如WIFI、蓝牙等),也可以是有线网络。
其中,第一电子设备11和第二电子设备12可以是同一种电子设备,如第一电子设备11和第二电子设备12均为服务器;也可以是不同类型的电子设备,例如,第一电子设备11可以是智能手机或其它智能终端,第二电子设备12可以是服务器。在一种可能的实施方式中,可以利用计算能力强的服务器作为第二电子设备12来提高数据处理效率及可靠性,进而提高模型训练和/或视觉问答的处理效率。同时利用成本低,应用范围广的智能手机作为第一电子设备11,用于实现第二电子设备12与用户之间的交互。可以理解的是,该交互过程可以为:第一电子设备11将待分析图像和第一文本传输至第二电子设备12,第二电子设备12对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本,从而将第二文本反馈至第一电子设备11。
接下来,详细介绍本申请实施例所提供的一种视觉问答任务的处理方法。图3为本申请实施例提供的一种视觉问答任务的处理方法的流程图,该方法包括:
S301:对待分析图像和第一文本进行特征融合处理,得到融合特征。
传统方式中,通常都是直接用目标检测预训练权重对数据集的所有图片进行推理,然后每张图片计算检测框、分类、分类置信度以及提取的坐标框语义特征。然后通过分类置信度设置阈值或者设定每张图像输出的检测框数量来选择相应的检测框。
检测框指的是图片中目标物所在的位置区域。其中,目标物可以是与文本相关联的人或物,也可以是与文本非关联的人或物。例如,一张图片中包含一个女孩、一条狗、一片天空,女孩、狗、天空均可以作为目标物,目标物对应的检测框可以包括女孩所在位置区域、狗所在的位置区域、天空所在的位置区域。
传统方式中产生的检测框数量往往较多,按照设置的阈值或设定的数量选择检测框的方式,并不能很好的选择出与文本具有强相关性的检测框,导致后续视觉问答模型生成的与文本对应的答案也并不合适。
因此在本申请实施例中,为了提升视觉问答任务的性能,可以对待分析图像和第一文本进行特征融合处理,得到融合特征,以便于依据融合特征对检测框进行筛选,删除与文本相关性较弱的检测框。
在实际应用中,可以提取待分析图像的图像特征;其中,图像特征包括多个检测框各 自对应的图像特征。对第一文本进行特征编码,得到文本特征;将图像特征与文本特征进行融合,得到融合特征。融合特征包含各检测框的坐标信息。
在本申请实施例中,待分析图像可以为任意一幅图片,第一文本可以是针对于待分析图像所提出的问题。例如,图片中包含一个女孩和一条狗在海滩上坐着玩,第一文本可以是“Where is the women sitting(女生坐在哪)”。
S302:依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框。
融合特征可以基于待分析图像的图像特征和第一文本的文本特征得到。
图像特征和文本特征均可以以检测框的形式呈现。对于待分析图像与第一文本的相关性,可以基于检测框之间的IOU值(Intersection Over Union,交并比)进行评估。
在具体实现中,可以计算图像特征中包含的各图像检测框与文本特征对应的文本检测框的交并比;从所有图像检测框中选取出交并比大于预设阈值的目标检测框。
预设阈值的取值可以根据实际需求灵活设置,例如可以设置为0.5。各图像检测框的处理方式类似,以一个图像检测框为例,可以计算图像检测框与文本检测框的IOU值。IOU值大于0.5,说明该图像检测框与文本检测框具有较强的相关性,属于正样本,此时可以将该图像检测框作为目标检测框,参与后续的分析流程。
S303:将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本。
其中,第一文本与第二文本具有逻辑对应关系。例如,第一文本可以是问题文本,第二文本可以是答案文本。
目标检测框可以为一个或多个,目标检测框的数量小于图像特征中包含的图像检测框的数量。
在筛选出目标检测框后,可以通过前向传播网络(Feed-Forward Network,FFN)模块提取出目标检测框所对应的坐标信息、分类类别和语义特征。将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,可以得到与第一文本匹配的第二文本。
视觉问答模型可以采用VINVL(Visual representations in Vision-Language Models,视觉语言模型中的视觉表示)或者LXMERT(Learning Cross-Modality Encoder Representations from Transformers,从变压器学习跨模态编码器表示)。
由上述技术方案可以看出,对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;每个检测框有其对应的图像信息,融合特征中所对应的检测框数量往往较多,检测框中既包含与第一文本具有较强相关性的检测框,也包含与第一文本具有较弱相关性的检测框。为了能够删除相关性较弱的检测框,可以依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。在该技术方案中,通过对待分析图像和第一文本进行特征融合处理,可以实现对待分析图像和第一文本的综合分析。基于相关性对检测框进行删减,有效的降低了无效检测框造成的干扰,减少了视觉问答模型的计算量,提升了视觉问答任务的性能。
在本申请实施例中,可以采用目标检测模型和视觉问答模型相结合的方式,实现视觉问答任务的处理。目标检测模型可以对待分析图像和第一文本进行分析,从而筛选出满足相关性要求的目标检测框,并且提取出目标检测框对应的坐标信息、分类类别和语义特征。 其中,目标检测模型可以基于历史图像和历史文本训练得到。
在实际应用中,可以利用训练好的目标检测模型对待分析图像和第一文本进行特征融合处理,从而得到融合特征,并从融合特征中筛选出满足相关性要求的目标检测框。
本申请实施例采用的目标检测模型所依赖的基础模型可以是DETR(DEtection TRansformer,基于transformer的目标检测网络),该模型利用近来大火的transformer结构,将目标检测变为一个检测框与标准框(Ground Truth)的二分匹配问题。
图4为本申请实施例提供的一种目标检测模型的网络结构图,目标检测模型包括骨干网络、编码模块(Transformer encoder)、解码模块(Transformer decoder)、融合模块、前向传播网络模块(Feed-Forward Network,FFN)以及语言编码模块(Roberta)。其中,骨干网络、编码模块和解码模块可以实现图像特征的提取。语言编码模块可以提取文本特征。融合模块可以实现对图像特征和文本特征的融合,从而筛选出目标检测框。前向传播网络模块可以提取出目标检测框的坐标信息、分类类别和语义特征。
在本申请实施例中,可以将骨干网络、编码模块和解码模块作为目标检测模块。利用目标检测模型的目标检测模块提取待分析图像的图像特征;其中,图像特征可以包括多个检测框各自对应的图像特征。利用目标检测模型的语言编码模块对第一文本进行特征编码,得到文本特征;利用目标检测模型的融合模块将图像特征与文本特征进行融合,得到融合特征。
训练好的目标检测模型才能够用于实现对待分析图像和第一文本的分析处理。目标检测模型的训练是目标检测模型进行视觉问答任务处理的基础前提,图5为本申请实施例提供的一种目标检测模型的训练方法的流程图,该方法包括:
S501:利用目标检测数据集训练初始检测模型,以得到初始检测模型对应的权重参数。
目标检测数据集可以包括COCO(Common Obiects in Context,图像识别、分割和图像语义数据集)数据集、Visual Genome(视觉基因)数据集、Obiects365数据集等。
在模型训练阶段,图片首先通过骨干网络即CNN(Convolutional Neural Networks,卷积神经网络)提取特征,同时加入位置编码特征,位置编码特征是根据图像的分辨率自适应获得的,其意义是得到图像特征图的局部位置信息。采用transformer的encoder来编码图像特征,设置可学习的初始化嵌入参数query,从编码图像特征中解码出对应目标位置及分类。这些query相当于自适应anchor(目标检测预定义锚点)信息,通过解码器解码出对应物体的检测位置及相应类别。训练过程中引入了Bipartite Matching(二分图最大匹配)来完成Ground Truth坐标框同检测框的匹配。匹配策略如下:
Figure PCTCN2022142512-appb-000003
其中,y i表示Ground Truth坐标框,y i pred表示检测框,这里是利用了匈牙利匹配算法进行检测框及坐标框的匹配。argmin表示使
Figure PCTCN2022142512-appb-000004
达到最小值时y i和y i pred的取值。L match表示检测框和坐标框的匹配度。
假设,图片中有N(N<100)个物体,那么从100个query中经过匈牙利匹配算法后只有N个检测框与Ground Truth坐标框相对应,这样就不需要有传统目标检测框架中的NMS 去除重复框的操作了。
S502:基于目标检测数据集中各样本对应的样本标签,对初始检测模型进行正负样本判别训练。
以依据图像特征生成与问题对应的答案为例,正样本可以是与问题具有较强相关性的图像特征,负样本可以是与问题相关性较弱的图像特征。
在具体实现中,可以利用初始检测模型识别目标检测数据集中各样本对应的概率值。概率值越高,说明样本包含的图像特征与问题具有的相关性越强。
目标检测数据集中样本可以是目标检测数据集中各图片所对应的检测框,每个检测框有其对应的图像特征。
依据目标检测数据集中各样本对应的样本标签以及概率值,可以确定出正负样本对应的损失函数;基于正负样本对应的损失函数,调整初始检测模型中融合模块对应的参数,以完成正负样本判别训练。
对于正负样本损失函数的确定,可以设置正负样本损失函数计算公式,将目标检测数据集中各样本对应的样本标签以及概率值输入至正负样本损失函数计算公式,以确定出正负样本对应的损失函数;其中,正负样本损失函数计算公式为:
Figure PCTCN2022142512-appb-000005
其中,N表示样本总个数,y i表示第i个样本的样本标签对应的数值,样本标签为正样本时y i=1,样本标签为负样本时y i=0,w +表示正样本对应的阈值,p i表示第i个样本属于正样本的概率值,w -表示负样本对应的阈值。
考虑到实际应用中,正负样本比例会存在不均衡的问题,正样本比例往往较小,因此可以令阈值w +=40,w -=1。
S503:在完成正负样本判别训练后,计算初始检测模型的损失函数。
其中,损失函数可以包括初始损失函数和正负样本对应的损失函数。正负样本对应的损失函数的计算方式可以参见上述介绍,在此不再赘述。
初始损失函数的计算公式如下:
Figure PCTCN2022142512-appb-000006
其中,初始损失函数包含三项,第一项表示分类损失,第二表示IOU损失、第三项表示L 1损失。y表示ground truth坐标框,y pred表示通过提取图像特征得到的检测框,σ i表示ground truth序号为i的坐标框对应的检测框中的序号。p σ(i)(c i)表示与ground truth对应的检测框的分类概率。b i表示ground truth序列为i的坐标位置,即[x1,y1,x2,y2]。同理,b σ(i)为与ground truth匹配的检测框的坐标。λ iou及λ 1分别表示坐标框的回归损失系数,本申请 中可以均设置为1。L iou表示IOU损失,L 1表示L 1损失。
L iou的计算公式如下:
Figure PCTCN2022142512-appb-000007
L 1的计算公式如下,即为检测框与Ground Truth四个点的坐标的绝对值之和:
L 1(b i,b σ(i))=|b i-b σ(i)|。
S504:依据初始检测模型的损失函数,对初始检测模型中包含的语言编码模块和融合模块各自的初始化权重以及初始检测模型对应的权重参数进行调整,得到训练好的目标检测模型。
相比于传统的目标检测模型,本申请基于问题的优化目标检测模型,添加了语言编码模块及融合模块。语言编码模块可以采用Roberta-base预训练权重,对问题产生编码特征q={q 0,q 1,...,q Nw-1},特征维度为768。
融合模块网络结构图如图6所示,融合模块包含有两个单一模态变压器模型(intra-attention),一个跨模态变压器模型(cross-transformer),一个线性层和一个正负样本判别模块。其中,线性层可以和目标检测模型的FFN模块连接。
在实际应用中,语言编码模块会将Roberta编码的文本特征输入intra-transformer网络模块中,解码模块会将DETR的解码器输出的图像特征输入到intra-transformer网络模块中。然后将两个模态融合的特征继续通过跨模态transformer(cross-transformer)网络模块中,最终将输出的融合特征输入线性层。DETR中预设了100个query向量,相当于会产生100个检测框。这里根据问题定位相关的坐标框给定每个检测框正样本或是负样本标签。判定准则可以是根据检测得到坐标框与GQA数据集中Ground Truth中给定的和问题相关的坐标框的IOU值。如果两者的IOU值大于0.5,则判定为正样本,否则判定为负样本。这里首先进行一个正负样本判别的训练,随着训练次数的增加,再逐渐添加FFN模块进行坐标框的分类及相关位置坐标回归的优化。
基于损失函数调整模型参数的方式属于现有较为常规的方式,在此不再赘述。
在得到训练好的目标检测模型之后,可以利用训练好的目标检测模型从目标检测数据集中筛选出正样本;利用正样本对应的坐标信息、分类类别和语义特征对初始视觉问答模型进行训练,以得到训练好的视觉问答模型。
本申请实施例提供的目标检测模型的框架着重优化目标检测提取视觉线索的流程,将问题输入到目标检测模型中,能够成功检测出和问题直接相关或者间接推理相关的目标检测框,能够极大地删除传统方案中多余的目标检测框;从视觉问答任务性能上来看,优化了视觉线索,从而极大地提升了任务性能。
本申请实施例提供的视觉问答任务的处理方案可以很便利地应用到手机、FPGA(Field-Programmable Gate Array,现场可编程门阵列)芯片等终端设备中。基于所需实现 的功能,可以划分为优化视觉线索模块和视觉问答模块。其中,优化视觉线索模块主要由骨干网络、目标检测模块(包括编码模块和解码模块)及MLP(Multilayer Perceptron,多层感知器)模块(包括融合模块和FFN模块)组成。
骨干网络采用的是Swin Transformer结构,目标检测模块采用的是基础的Transformer encoder及Transformer decoder模块,MLP模块则是由一系列全连接及矩阵向量操作组成。因为Transformer和MLP网络中全部为矩阵的乘加操作,在硬件设备上可以很方便地进行并行加速处理。
因此在实际应用中,第一文本可以为多个问题文本,相应的,第二文本为与各问题文本各自匹配的答案文本。在具体实现中,可以利用训练好的目标检测模型对待分析图像以及多个问题文本进行并行分析,以得到各问题文本各自对应的目标检测框。
以手机为例,图7为本申请实施例提供的一种在手机端并行处理不同的视觉问答任务的示意图,手机上可以设置两个模型,每个模型均包含一个优化视觉线索模块和一个视觉问答模块。两个模型均不存在卷积操作,因此可以并行推理。其中,优化视觉线索模块的作用在于给定问题和整张图像,输出和问题相关的部分图像区域及对应区域的分类;如给定“Is the person happy”这个问题和整张图像,输出为女孩区域及狗的区域,并输出dog和girl。视觉问答模块将上一步得到的结果和问题一起作为输入,推理出最终的答案“Yes”。如给定“What is the weather like”这个问题和整张图像,输出为天空区域,并输出Sky。视觉问答模块将上一步得到的结果和问题一起作为输入,推理出最终的答案“Sunny”。
通过在终端设备上部署多个优化视觉线索模块和多个视觉问答模块,可以实现对多个视觉问答任务的并行处理,极大的提升了视觉问答任务的处理效率,并且可以充分发挥终端设备的性能。
图8为本申请实施例提供的一种视觉问答任务的处理装置的结构示意图,包括融合单元81、筛选单元82和得到单元83;
融合单元81,用于对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;
筛选单元82,用于依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;
得到单元83,用于将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。
可选地,筛选单元包括计算子单元和选取子单元;
计算子单元,用于计算待分析图像的图像特征中包含的各图像检测框与第一文本的文本特征对应的文本检测框的交并比;
选取子单元,用于从所有图像检测框中选取出交并比大于预设阈值的目标检测框。
可选地,筛选单元用于利用训练好的目标检测模型从融合特征中筛选出满足相关性要求的目标检测框;其中,目标检测模型基于历史图像和历史文本训练得到。
可选地,针对于目标检测模型的训练过程,装置包括训练单元、判别单元、计算单元和调整单元;
训练单元,用于利用目标检测数据集训练初始检测模型,以得到初始检测模型对应的权重参数;
判别单元,用于基于目标检测数据集中各样本对应的样本标签,对初始检测模型进行 正负样本判别训练;
计算单元,用于在完成正负样本判别训练后,计算初始检测模型的损失函数;其中,损失函数包括初始损失函数和正负样本对应的损失函数;
调整单元,用于依据初始检测模型的损失函数,对初始检测模型中包含的语言编码模块和融合模块各自的初始化权重以及初始检测模型对应的权重参数进行调整,得到训练好的目标检测模型。
可选地,判别单元包括识别子单元、确定子单元和参数调整子单元;
识别子单元,用于利用初始检测模型识别目标检测数据集中各样本对应的概率值;
确定子单元,用于依据目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数;
参数调整子单元,用于基于正负样本对应的损失函数,调整初始检测模型中融合模块对应的参数,以完成正负样本判别训练。
可选地,确定子单元用于将目标检测数据集中各样本对应的样本标签以及概率值输入至正负样本损失函数计算公式,以确定出正负样本对应的损失函数;其中,正负样本损失函数计算公式为:
Figure PCTCN2022142512-appb-000008
其中,N表示样本总个数,y i表示第i个样本的样本标签对应的数值,样本标签为正样本时y i=1,样本标签为负样本时y i=0,w +表示正样本对应的阈值,p i表示第i个样本属于正样本的概率值,w -表示负样本对应的阈值。
可选地,针对于视觉问答模型的训练过程,装置包括问答训练单元;
筛选单元还用于利用训练好的目标检测模型从目标检测数据集中筛选出正样本;
问答训练单元,用于利用正样本对应的坐标信息、分类类别和语义特征对初始视觉问答模型进行训练,以得到训练好的视觉问答模型。
可选地,融合单元包括提取子单元、编码子单元和特征融合子单元;
提取子单元,用于利用目标检测模型的目标检测模块提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
编码子单元,用于利用目标检测模型的语言编码模块对第一文本进行特征编码,得到文本特征;
特征融合子单元,用于利用目标检测模型的融合模块将图像特征与文本特征进行融合,得到融合特征。
可选地,第一文本为问题文本;第二文本为与问题文本匹配的答案文本。
可选地,第一文本为多个问题文本,第二文本为与各问题文本各自匹配的答案文本;
相应的,筛选单元用于利用训练好的目标检测模型对待分析图像以及多个问题文本进行并行分析,以得到各问题文本各自对应的目标检测框。
可选地,融合单元包括提取子单元、编码子单元和特征融合子单元;
提取子单元,用于提取待分析图像的图像特征;其中,图像特征包括多个检测框各自对应的图像特征;
编码子单元,用于对第一文本进行特征编码,得到文本特征;
特征融合子单元,用于将图像特征与文本特征进行融合,得到融合特征。
图8所对应实施例中特征的说明可以参见图3和图5所对应实施例的相关说明,这里不再一一赘述。
由上述技术方案可以看出,对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;每个检测框有其对应的图像信息,融合特征中所对应的检测框数量往往较多,检测框中既包含与第一文本具有较强相关性的检测框,也包含与第一文本具有较弱相关性的检测框。为了能够删除相关性较弱的检测框,可以依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。在该技术方案中,通过对待分析图像和第一文本进行特征融合处理,可以实现对待分析图像和第一文本的综合分析。基于相关性对检测框进行删减,有效的降低了无效检测框造成的干扰,减少了视觉问答模型的计算量,提升了视觉问答任务的性能。
图9为本申请实施例提供的一种终端设备的结构示意图,包括显示屏91,输入接口92,以及分别与显示屏91、输入接口92连接的处理器;由于处理器内置于终端设备,因此在图9中未示出处理器。
输入接口92,用于接收待分析图像和第一文本;
处理器,用于对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系;
显示屏91,用于展示第一文本及其对应的第二文本。
图9所对应实施例中特征的说明可以参见图3和图5所对应实施例的相关说明,这里不再一一赘述。
输入接口92可以用于实现与外部设备如U盘的连接。输入接口可以有多个,图9中以一个输入接口为例。在实际应用中,用户可以通过输入键盘向终端设备输入待分析图像和第一文本,也可以将待分析图像和第一文本写入U盘,将U盘插入终端设备的输入接口92。终端设备在获取到待分析图像和第一文本后,可以将待分析图像和第一文本传输至处理器,处理器在对待分析图像和第一文本分析后,可以得到与第一文本匹配的第二文本,此时终端设备可以通过显示屏91展示第二文本。
需要说明的是,图9中终端设备包含的显示屏91、输入接口92、处理器等功能模块仅是举例说明,在实际应用中,基于实际需求终端设备也可以包含更多或更少的功能模块,对此不做限定。
由上述技术方案可以看出,对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,融合特征包含各检测框的坐标信息;每个检测框有其对应的图像信息,融合特征中所对应的检测框数量往往较多,检测框中既包含与第一文本具有较强相关性的检测框,也包含与第一文本具有较弱相关性的检测框。为了能够删除相关性较弱的检测框,可以依据待分析图像与第一文本的相关性,从融合特征中筛选出满足相关性要求的目标检测框;将目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与第一文本匹配的第二文本;其中,第一文本与第二文本具有逻辑对应关系。在该技术方 案中,通过对待分析图像和第一文本进行特征融合处理,可以实现对待分析图像和第一文本的综合分析。基于相关性对检测框进行删减,有效的降低了无效检测框造成的干扰,减少了视觉问答模型的计算量,提升了视觉问答任务的性能。
可以理解的是,如果上述实施例中的视觉问答任务的处理方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的介质。
基于此,本申请实施例还提供了一种非易失性可读存储介质,非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述视觉问答任务的处理方法的步骤。
以上对本申请实施例所提供的一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质进行了详细介绍。说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上对本申请所提供的一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。

Claims (20)

  1. 一种视觉问答任务的处理方法,其特征在于,包括:
    对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,所述融合特征包含各检测框的坐标信息;
    依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框;
    将所述目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与所述第一文本匹配的第二文本;其中,所述第一文本与所述第二文本具有逻辑对应关系。
  2. 根据权利要求1所述的视觉问答任务的处理方法,其特征在于,所述依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框包括:
    计算所述待分析图像的图像特征中包含的各图像检测框与所述第一文本的文本特征对应的文本检测框的交并比;
    从所有所述图像检测框中选取出交并比大于预设阈值的目标检测框。
  3. 根据权利要求2所述的视觉问答任务的处理方法,其特征在于,所述目标检测框的数量小于所述图像特征中包含的图像检测框的数量。
  4. 根据权利要求1所述的视觉问答任务的处理方法,其特征在于,所述依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框包括:
    利用训练好的目标检测模型从所述融合特征中筛选出满足相关性要求的目标检测框;其中,所述目标检测模型基于历史图像和历史文本训练得到。
  5. 根据权利要求4所述的视觉问答任务的处理方法,其特征在于,所述目标检测模型包括骨干网络、编码模块、解码模块、融合模块、前向传播网络模块以及语言编码模块,其中:
    所述骨干网络、所述编码模块和所述解码模块用于实现所述待分析图像的图像特征的提取;
    所述语言编码模块用于提取所述第一文本的文本特征;
    所述融合模块用于实现对所述图像特征和所述文本特征的融合,从而筛选出所述目标检测框;
    所述前向传播网络模块用于提取出所述目标检测框的坐标信息、分类类别和语义特征。
  6. 根据权利要求5所述的视觉问答任务的处理方法,其特征在于,在所述目标检测模型的训练阶段,具有所述待分析图像的图片首先通过所述骨干网络提取所述图像特征,同时加入位置编码特征,位置编码特征是根据所述待分析图像的分辨率自适应获得的,其意义是得到图像特征图的局部位置信息;采用所述编码模块来编码所述图像特征,设置可学习的初始化嵌入 参数,从编码图像特征中解码出对应目标位置及分类,所述初始化嵌入参数相当于目标检测预定义锚点信息,通过所述解码模块中的解码器解码出对应物体的检测位置及相应类别。
  7. 根据权利要求4所述的视觉问答任务的处理方法,其特征在于,针对于所述目标检测模型的训练过程,所述方法包括:
    利用目标检测数据集训练初始检测模型,以得到所述初始检测模型对应的权重参数;
    基于所述目标检测数据集中各样本对应的样本标签,对所述初始检测模型进行正负样本判别训练;
    在完成正负样本判别训练后,计算所述初始检测模型的损失函数;其中,所述损失函数包括初始损失函数和正负样本对应的损失函数;
    依据所述初始检测模型的损失函数,对所述初始检测模型中包含的语言编码模块和融合模块各自的初始化权重以及所述初始检测模型对应的权重参数进行调整,得到训练好的目标检测模型。
  8. 根据权利要求7所述的视觉问答任务的处理方法,其特征在于,所述融合模块包含有两个单一模态变压器模型、一个跨模态变压器模型,一个线性层和一个正负样本判别模块。
  9. 根据权利要求7所述的视觉问答任务的处理方法,其特征在于,所述基于所述目标检测数据集中各样本对应的样本标签,对所述初始检测模型进行正负样本判别训练包括:
    利用所述初始检测模型识别所述目标检测数据集中各样本对应的概率值;
    依据所述目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数;
    基于所述正负样本对应的损失函数,调整所述初始检测模型中融合模块对应的参数,以完成正负样本判别训练。
  10. 根据权利要求9所述的视觉问答任务的处理方法,其特征在于,所述依据所述目标检测数据集中各样本对应的样本标签以及概率值,确定出正负样本对应的损失函数包括:
    将所述目标检测数据集中各样本对应的样本标签以及概率值输入至正负样本损失函数计算公式,以确定出正负样本对应的损失函数;其中,正负样本损失函数计算公式为:
    Figure PCTCN2022142512-appb-100001
    其中,N表示样本总个数,y i表示第i个样本的样本标签对应的数值,样本标签为正样本时y i=1,样本标签为负样本时y i=0,w +表示正样本对应的阈值,p i表示第i个样本属于正样本 的概率值,w -表示负样本对应的阈值。
  11. 根据权利要求7所述的视觉问答任务的处理方法,其特征在于,针对于所述视觉问答模型的训练过程,所述方法包括:
    利用训练好的目标检测模型从所述目标检测数据集中筛选出正样本;
    利用所述正样本对应的坐标信息、分类类别和语义特征对初始视觉问答模型进行训练,以得到训练好的视觉问答模型。
  12. 根据权利要求7所述的视觉问答任务的处理方法,其特征在于,所述对待分析图像和第一文本进行特征融合处理,得到融合特征包括:
    利用所述目标检测模型的目标检测模块提取所述待分析图像的图像特征;其中,所述图像特征包括多个检测框各自对应的图像特征;
    利用所述目标检测模型的语言编码模块对所述第一文本进行特征编码,得到文本特征;
    利用所述目标检测模型的融合模块将所述图像特征与所述文本特征进行融合,得到融合特征。
  13. 根据权利要求1-12任意一项所述的视觉问答任务的处理方法,其特征在于,所述第一文本为问题文本;所述第二文本为与所述问题文本匹配的答案文本。
  14. 根据权利要求13所述的视觉问答任务的处理方法,其特征在于,所述第一文本为多个问题文本,所述第二文本为与各所述问题文本各自匹配的答案文本;
    相应的,所述依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框包括:
    利用训练好的目标检测模型对所述待分析图像以及多个所述问题文本进行并行分析,以得到各所述问题文本各自对应的目标检测框。
  15. 根据权利要求1所述的视觉问答任务的处理方法,其特征在于,所述对待分析图像和第一文本进行特征融合处理,得到融合特征包括:
    提取所述待分析图像的图像特征;其中,所述图像特征包括多个检测框各自对应的图像特征;
    对所述第一文本进行特征编码,得到文本特征;
    将所述图像特征与所述文本特征进行融合,得到融合特征。
  16. 根据权利要求5所述的视觉问答任务的处理方法,其特征在于,还包括:
    在筛选出所述目标检测框后,通过所述前向传播网络模块提取出所述目标检测框对应的坐 标信息、分类类别和语义特征。
  17. 一种视觉问答任务的处理装置,其特征在于,包括融合单元、筛选单元和得到单元;
    所述融合单元,用于对待分析图像和第一文本进行特征融合处理,得到融合特征;其中,所述融合特征包含各检测框的坐标信息;
    所述筛选单元,用于依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框;
    所述得到单元,用于将所述目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与所述第一文本匹配的第二文本;其中,所述第一文本与所述第二文本具有逻辑对应关系。
  18. 一种终端设备,其特征在于,包括显示屏,输入接口,以及分别与所述显示屏、所述输入接口连接的处理器;
    所述输入接口,用于接收待分析图像和第一文本;
    所述处理器,用于对所述待分析图像和所述第一文本进行特征融合处理,得到融合特征;其中,所述融合特征包含各检测框的坐标信息;依据所述待分析图像与所述第一文本的相关性,从所述融合特征中筛选出满足相关性要求的目标检测框;将所述目标检测框对应的坐标信息、分类类别和语义特征输入训练好的视觉问答模型,以得到与所述第一文本匹配的第二文本;其中,所述第一文本与所述第二文本具有逻辑对应关系;
    所述显示屏,用于展示所述第一文本及其对应的所述第二文本。
  19. 一种电子设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序以实现如权利要求1至16任意一项所述视觉问答任务的处理方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至16任意一项所述视觉问答任务的处理方法的步骤。
PCT/CN2022/142512 2022-09-02 2022-12-27 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质 WO2024045444A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211068333.9A CN115129848B (zh) 2022-09-02 2022-09-02 一种视觉问答任务的处理方法、装置、设备和介质
CN202211068333.9 2022-09-02

Publications (1)

Publication Number Publication Date
WO2024045444A1 true WO2024045444A1 (zh) 2024-03-07

Family

ID=83387703

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142512 WO2024045444A1 (zh) 2022-09-02 2022-12-27 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质

Country Status (2)

Country Link
CN (1) CN115129848B (zh)
WO (1) WO2024045444A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874706A (zh) * 2024-03-12 2024-04-12 之江实验室 一种多模态知识蒸馏学习方法及装置
CN117892140A (zh) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 视觉问答及其模型训练方法、装置、电子设备、存储介质
CN118151818A (zh) * 2024-05-08 2024-06-07 浙江口碑网络技术有限公司 基于视觉内容的交互方法以及装置
CN118194923A (zh) * 2024-05-17 2024-06-14 北京大学 大语言模型的构建方法、装置、设备及计算机可读介质
CN118245854A (zh) * 2024-05-29 2024-06-25 浙江大华技术股份有限公司 输电线路检测方法、装置、设备以及存储介质
CN118410877A (zh) * 2024-07-04 2024-07-30 杭州海康威视数字技术股份有限公司 一种答案确定方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115129848B (zh) * 2022-09-02 2023-02-28 苏州浪潮智能科技有限公司 一种视觉问答任务的处理方法、装置、设备和介质
CN115861995B (zh) * 2023-02-08 2023-05-23 山东海量信息技术研究院 一种视觉问答方法、装置及电子设备和存储介质
CN116884003B (zh) * 2023-07-18 2024-03-22 南京领行科技股份有限公司 图片自动标注方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228703A (zh) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 图像问答方法、装置、系统和存储介质
CN111860653A (zh) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 一种视觉问答方法、装置及电子设备和存储介质
CN113435998A (zh) * 2021-06-23 2021-09-24 平安科技(深圳)有限公司 贷款逾期预测方法、装置、电子设备及存储介质
US20210406619A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for visual question answering, computer device and medium
CN114840651A (zh) * 2022-04-20 2022-08-02 南方科技大学 视觉问答的训练方法、系统及计算机可读存储介质
CN115129848A (zh) * 2022-09-02 2022-09-30 苏州浪潮智能科技有限公司 一种视觉问答任务的处理方法、装置、设备和介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111994377B (zh) * 2020-07-21 2022-04-08 浙江大华技术股份有限公司 包装箱工序检测的方法、装置和计算机设备
CN112949630B (zh) * 2021-03-01 2024-03-19 北京交通大学 基于边框分级筛选的弱监督目标检测方法
CN114972944B (zh) * 2022-06-16 2023-10-27 中国电信股份有限公司 视觉问答模型的训练方法及装置、问答方法、介质、设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228703A (zh) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 图像问答方法、装置、系统和存储介质
US20210406619A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for visual question answering, computer device and medium
CN111860653A (zh) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 一种视觉问答方法、装置及电子设备和存储介质
CN113435998A (zh) * 2021-06-23 2021-09-24 平安科技(深圳)有限公司 贷款逾期预测方法、装置、电子设备及存储介质
CN114840651A (zh) * 2022-04-20 2022-08-02 南方科技大学 视觉问答的训练方法、系统及计算机可读存储介质
CN115129848A (zh) * 2022-09-02 2022-09-30 苏州浪潮智能科技有限公司 一种视觉问答任务的处理方法、装置、设备和介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874706A (zh) * 2024-03-12 2024-04-12 之江实验室 一种多模态知识蒸馏学习方法及装置
CN117874706B (zh) * 2024-03-12 2024-05-31 之江实验室 一种多模态知识蒸馏学习方法及装置
CN117892140A (zh) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 视觉问答及其模型训练方法、装置、电子设备、存储介质
CN117892140B (zh) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 视觉问答及其模型训练方法、装置、电子设备、存储介质
CN118151818A (zh) * 2024-05-08 2024-06-07 浙江口碑网络技术有限公司 基于视觉内容的交互方法以及装置
CN118194923A (zh) * 2024-05-17 2024-06-14 北京大学 大语言模型的构建方法、装置、设备及计算机可读介质
CN118245854A (zh) * 2024-05-29 2024-06-25 浙江大华技术股份有限公司 输电线路检测方法、装置、设备以及存储介质
CN118410877A (zh) * 2024-07-04 2024-07-30 杭州海康威视数字技术股份有限公司 一种答案确定方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115129848B (zh) 2023-02-28
CN115129848A (zh) 2022-09-30

Similar Documents

Publication Publication Date Title
WO2024045444A1 (zh) 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质
CN108829757B (zh) 一种聊天机器人的智能服务方法、服务器及存储介质
CN107797984B (zh) 智能交互方法、设备及存储介质
WO2021031480A1 (zh) 文本生成方法和装置
WO2024000867A1 (zh) 情绪识别方法、装置、设备及存储介质
CN110070484B (zh) 图像处理、图像美化方法、装置和存储介质
CN105531758A (zh) 使用外国单词语法的语音识别
CN112364168A (zh) 一种基于多属性信息融合的舆情分类方法
WO2024098623A1 (zh) 跨媒体检索及模型训练方法、装置、设备、菜谱检索系统
CN115455171B (zh) 文本视频的互检索以及模型训练方法、装置、设备及介质
WO2024066920A1 (zh) 虚拟场景的对话方法、装置、电子设备、计算机程序产品及计算机存储介质
CN110263218B (zh) 视频描述文本生成方法、装置、设备和介质
CN108021897A (zh) 图片问答方法及装置
CN112632244A (zh) 一种人机通话的优化方法、装置、计算机设备及存储介质
US20240177506A1 (en) Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption
CN111597341A (zh) 一种文档级关系抽取方法、装置、设备及存储介质
CN114339450A (zh) 视频评论生成方法、系统、设备及存储介质
CN112712068A (zh) 一种关键点检测方法、装置、电子设备及存储介质
CN112115131A (zh) 数据去噪方法、装置、设备及计算机可读存储介质
CN114548274A (zh) 一种基于多模态交互的谣言检测方法及系统
CN116310983A (zh) 多模态情感识别方法及装置
CN115471885A (zh) 动作单元相关性学习方法、装置、电子设备及存储介质
CN112861474B (zh) 一种信息标注方法、装置、设备及计算机可读存储介质
WO2024093578A1 (zh) 语音识别方法、装置、电子设备、存储介质及计算机程序产品
WO2024114303A1 (zh) 音素识别方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22957254

Country of ref document: EP

Kind code of ref document: A1