CN114186039A

CN114186039A - Visual question answering method and device and electronic equipment

Info

Publication number: CN114186039A
Application number: CN202111428675.2A
Authority: CN
Inventors: 焦佳成
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-15

Abstract

The disclosure provides a visual question answering method, a visual question answering device and electronic equipment, relates to the field of computer vision, and particularly relates to the technical field of deep learning and cloud computing. The specific implementation scheme is as follows: acquiring an input image and an input text corresponding to the input image; determining a first image characteristic corresponding to the input image and a text characteristic corresponding to the text; performing at least two times of fusion processing based on the first image characteristic and the text characteristic to obtain a fusion characteristic; and determining output information corresponding to the input image based on the fusion characteristics.

Description

Visual question answering method and device and electronic equipment

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a visual question-answering method and apparatus in the field of visual question-answering, and an electronic device.

Background

In a customer service scenario, etc., it is generally necessary for an intelligent customer service robot to answer questions posed by a user. Therefore, how to accurately answer questions posed by users is a constantly sought goal in the field of visual question-answering.

Disclosure of Invention

The disclosure provides a visual question answering method, a visual question answering device and electronic equipment.

According to a first aspect of the present disclosure, there is provided a visual question-answering method, comprising:

acquiring an input image and an input text corresponding to the input image;

determining a first image characteristic corresponding to the input image and a text characteristic corresponding to the text;

performing at least two times of fusion processing based on the first image characteristic and the text characteristic to obtain a fusion characteristic;

and determining output information corresponding to the input image based on the fusion characteristics.

According to a second aspect of the present disclosure, there is provided a visual question-answering device comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring an input image and an input text corresponding to the input image;

the first determining module is used for determining a first image feature corresponding to the input image and a text feature corresponding to the text;

the feature fusion module is used for performing fusion processing at least twice based on the first image feature and the text feature to obtain fusion features;

and the second determination module is used for determining the output information corresponding to the input image based on the fusion characteristic.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the visual question-answering method described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described visual question-answering method.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the visual question-answering method according to the above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic process flow diagram of a visual question answering method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic view of a detailed processing flow of a visual question answering method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an input image processed by a convolutional neural network model provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an input image being processed by a convolutional layer provided by an embodiment of the present disclosure;

FIG. 5 is a schematic illustration of features obtained from convolutional layer processing being pooled as provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a language processing model provided by an embodiment of the present disclosure for learning a text;

fig. 7 is a schematic diagram of a server performing at least two fusion processes on a first image feature and a text feature of an input image according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an alternative configuration of a visual question-answering device provided in the embodiments of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing a visual question answering method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the description that follows, references to the terms "first \ second \ third" are intended merely to distinguish similar objects and do not denote a particular order, but rather are to be understood that "first \ second \ third" may, where permissible, be interchanged in a particular order or sequence so that embodiments of the disclosure described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the disclosure only and is not intended to be limiting of the disclosure.

Before describing embodiments of the present disclosure in detail, relevant terms related to the present disclosure are explained.

Computer Vision technology (Computer Vision, CV): is a science for researching how to make a machine look; specifically, a camera and a computer are used for replacing human eyes to identify, track and measure a target to obtain machine vision, and further image processing is carried out, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, three-dimensional technology (3-Dimension, 3D), virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition. In the embodiment of the present disclosure, by processing the input image, the corresponding answer can be output according to the input image.

Visual Question Answering (VQA) is a learning task that involves both computer vision and Natural Language Processing (NLP). Specifically, a picture, or a picture and a question about the free-form and open-form natural language of the picture are input into the computer device, and the output is as follows: the generated natural language answer. The computer device has certain understanding on the content of the picture, the meaning and intention of the problem and related common knowledge, so that the computer device outputs reasonable output information according with natural language rules according to the input picture and the problem.

The visual question answering method can be applied to intelligent customer service scenes. In the smart customer service scenario, customer service may include at least: pre-sale service and post-sale service; the questions posed before the customer purchases the product or service, and after the customer purchases the product or service, are answered. The intelligent client robot is used for client service, so that labor cost can be saved, and the efficiency of client service is improved; in the related art, the method for answering questions asked by a user by using an intelligent client robot mainly comprises two forms, wherein the first form is fixed answer content, such as a work order system responding to the first question of the user according to a preset fixed answer; therefore, the first form cannot analyze the user's question and question emotion, and cannot provide a targeted prompt and answer based on the analysis result of the user's question. The second form is that the questions of the user are answered according to a preset rule, for example, the questions of the user are subjected to keyword detection, and the questions of the user are answered according to the matching degree of the detected keywords and the preset keywords; the second form does not respond to professional answers to uncommon questioning.

The above-mentioned intelligent customer service scene is taken as an example only, and the visual question-answering method provided by the embodiment of the present disclosure may also be applied to other scenes that need visual conversation, such as a scene that explains pictures, and the embodiment of the present disclosure does not limit application scenes.

Fig. 1 is a schematic view of an alternative processing flow of a visual question-answering method provided in the present disclosure, which may include at least the following steps:

step S101, an input image and an input text corresponding to the input image are obtained.

In some optional embodiments, the input image is an image sent to the electronic device by the user through the terminal device, the input image may be an image shot by the user through the terminal device, the input image may also be an image stored in the terminal device corresponding to the user, and the input image may also be an application interface or an applet interface installed on the terminal device and captured by the user. The embodiment of the present disclosure does not impose a limitation on the type of input image.

In some alternative embodiments. After the input image is acquired, the legality of the input image can be analyzed; if the input image comprises the limited content, the subsequent processing is not carried out on the input image. Wherein the restricted content may include: pictures and texts relating to violence and pictures relating to gambling, etc.

In some optional embodiments, the input text corresponding to the input image may be sent to the electronic device by the user through the terminal device, or may be the input text corresponding to the input image acquired by the electronic device according to the input image; the input text corresponding to the input image may be a question asked based on the input image.

In specific implementation, aiming at the condition that a user sends an input text corresponding to an input image to the electronic equipment through the terminal equipment, the user can send voice information to the electronic equipment through the terminal equipment, and the electronic equipment obtains the input text corresponding to the input image through analyzing the voice information; or the user sends a document to the electronic equipment through the terminal equipment, wherein the document comprises an input text corresponding to the input image.

In specific implementation, aiming at the condition that the electronic equipment acquires the input text corresponding to the input image according to the input image, a plurality of images can be prestored in the electronic equipment, and the text corresponding to each image is prestored; wherein, the text corresponding to each image can be a question based on each image; forming a problem list by texts corresponding to a plurality of images; the questions in the question list may be formed based on domain knowledge corresponding to the image; for example, for the customer service field, by identifying and extracting key information from an image, an order number can be included in the image, and the problem corresponding to the image may be "what is the order number? ". Optionally, all questions asked based on each image may be pre-stored, or questions asked based on each image at high frequency may be pre-stored; the question frequency threshold of the pre-stored question can be flexibly set according to needs.

Step S102, determining a first image characteristic corresponding to an input image and a text characteristic corresponding to a text.

In some alternative embodiments, the first image feature and the text feature may be obtained based on two different neural network models included in a visual question-and-answer model included in the electronic device. As an example, the first image feature may be obtained by processing an input image by a convolutional neural network model; the text features can be learned from the natural language processing model.

And step S103, performing at least two times of fusion processing based on the first image characteristic and the text characteristic to obtain a fusion characteristic.

In some optional embodiments, the electronic device may process the first image feature based on the channel weight of the input image and the image weight of the input image to obtain a second image feature; and performing at least two times of fusion processing on the second image characteristic and the text characteristic to obtain a fusion characteristic. The channel weight is used for representing the importance degree of each channel in the input image to the detection of the input image; the image weights are used to characterize the importance of the input image.

And step S104, determining output information corresponding to the input image based on the fusion characteristics.

In some optional embodiments, the specific process of determining the output information corresponding to the input image based on the fusion features may include: determining at least one candidate output information corresponding to the input image based on the fusion characteristics; determining a confidence level of each candidate output information; and determining the candidate output information with the highest confidence coefficient as the output information corresponding to the input image.

In specific implementation, the fusion features can be input into a pre-trained visual question-answering model to obtain a plurality of candidate output information corresponding to the fusion features and the confidence of each candidate output information; and then carrying out normalization processing on the confidence level, and comparing the confidence level after the normalization processing to obtain the maximum value of the confidence level after the normalization processing, wherein the candidate output information corresponding to the maximum value of the confidence level is the output information corresponding to the input image.

The visual problem model is a pre-trained neural network model and has the performance of determining output information corresponding to the input image according to the fusion characteristics.

In some optional embodiments, a confidence threshold may also be preset, and the confidence of the output information corresponding to the input image should be greater than or equal to the confidence threshold. In some scenarios, if the confidence of all candidate output information is smaller than the confidence threshold, all candidate output information cannot be used as the confidence threshold of the input image.

In the embodiment of the disclosure, the image features and the text features are fused twice, the questions posed by the user are answered based on the two modal data of the text and the image, and compared with the method for answering the questions of the user according to the fixed answering content and the preset rule, the corresponding output information can be obtained more flexibly and accurately based on the input image.

With respect to step S103, the processing procedure of processing the first image feature based on the channel weight of the input image and the image weight of the input image to obtain the second image feature may include:

step a, respectively carrying out weighting processing on image characteristics of corresponding channels in first image characteristics based on channel weights corresponding to each channel in an input image to obtain channel image characteristics;

and b, carrying out weighting processing on the channel image characteristics based on the image weight of the input image to obtain second image characteristics.

In particular implementations, the input image may be a multi-channel image, such as a three-channel image; each channel corresponds to a weight coefficient, and the channel weights corresponding to the three channels can be different. Through the division of the importance degree of the channels, different channel weights are given to the channels, and the characteristics of each channel can be screened according to the importance degree of the channels. In some optional embodiments, there may be a case where the channel weight coefficients corresponding to some channels are too small, and if the channel weight coefficients are too small, the setting of the learning rate in the neural network model for the first image feature extraction may be toggled; based on this, a smaller channel weight coefficient may be made to correspond to a larger feature vector in the first image feature by the regularization process.

In a specific implementation, different input images correspond to different image weights, and the image weights are multiplied by the image features processed by the channel weights in a bitwise manner (which may also be referred to as cross multiplication) to obtain second image features.

In this way, by multiplying the first image feature by the corresponding channel weight coefficient for each channel and by the image weight coefficient corresponding to the input image, so that the weight coefficients by which the respective image slices corresponding to the input image are multiplied are different, the context image feature having a forward gain with respect to the detection result of the input image is gained, and the context image feature irrelevant to the detection result of the input image is penalized.

For step S103, performing at least two times of fusion processing on the second image feature and the text feature, and the processing procedure of obtaining the fusion feature may include:

and c, performing first fusion processing on the second image characteristic and the text characteristic to obtain a third characteristic image.

In some optional embodiments, the first fusing the second image feature and the text feature may include: adding the second image feature to the text feature; the second image feature and the text feature have the same dimensions.

And d, performing second fusion processing on the third image characteristic and the text characteristic to obtain a fusion characteristic.

In some optional embodiments, the second fusing of the third image feature and the text feature may include: determining a spatial attention weight coefficient for each pixel in the input image; carrying out weighting processing on the third image characteristic by using the spatial attention weight coefficient to obtain a fourth image characteristic; and adding the fourth image characteristic and the text characteristic to obtain the fusion characteristic.

Wherein the spatial attention weight coefficient of each pixel can enhance the attention mechanism of the region inside the input image; feature enhancement of regions in the input image can be achieved by multiplying the spatial attention weight coefficient of each pixel by the image feature of the corresponding pixel.

In the embodiment of the present disclosure, by processing the image features through the spatial attention mechanism, information that is useless for detecting the input image, such as background information included in the input image, can be weakened, and key information included in a partial region of the input image can be enhanced.

For step d, performing a second fusion process on the third image feature and the text feature, and a specific implementation process of obtaining the fusion feature may include: and performing feature fusion on the third image features and the text features in a matrix bitwise addition mode to obtain fusion features. And the third image characteristic has the same matrix dimension as the text characteristic.

Fig. 2 is a schematic detailed processing flow diagram of a visual question-answering method provided in the present disclosure, which may include at least the following steps:

in step S201, the user transmits an input image to the terminal device through the terminal device.

In some optional embodiments, an intelligent customer service is set in the application installed on the terminal device, and the user uploads an input image through the intelligent customer service, and inputs a schematic diagram of the image, as shown in fig. 3.

In step S202, the terminal device transmits the input image to the server.

In step S203, the server determines an input text corresponding to the input image according to the input image.

In some optional embodiments, the server detects the input image, acquires key information in the input image, and determines an input text corresponding to the input image based on the acquired key information.

In step S204, the server determines a first image feature corresponding to the input image and a text feature corresponding to the text.

In some optional embodiments, the server may process the input image through a convolutional neural network model included in the visual question-answering model to obtain the first image feature, and learn the text through a natural language processing model included in the visual question-answering model to obtain the text feature.

As an example, as shown in fig. 3, a schematic diagram of processing an input image by a convolutional neural network model may be that the input image is subjected to feature extraction by a convolutional layer, and then subjected to dimension reduction processing by a pooling layer; the input image is processed by 5 times of convolution and pooling layers, and then the first image feature is output through the full connection layer.

The schematic diagram of the input image after being processed by the convolutional layer is shown in fig. 4, and the image characteristics of a small region of the input image are obtained after the input image is processed by the convolutional layer. The schematic diagram of the features obtained by convolutional layer processing through pooling layer processing is shown in fig. 5, and the image features of four small regions are processed through pooling layer processing to obtain image features after dimensionality reduction.

As an example, the language processing model learns about text as shown in FIG. 6, X_tRepresenting the text input at the moment t, coding the input text by using a language processing model to obtain a vector with a preset length, and combining the input vector at the current moment with a long-term memory vector and a short-term memory vector of a previous processing module in one processing module to obtain text characteristics. Wherein σ and tanh are processing functions; sigma is the function of sigmod,

the tanh function is:

in step S205, the server performs at least two times of fusion processing on the first image feature and the text feature of the input image to obtain a fusion feature.

In some optional embodiments, the server performs at least two times of fusion processing on the first image feature and the text feature of the input image, as shown in fig. 7, the server processes the first image feature based on the channel weight and the image weight of the input image to obtain a second image feature; and combining a context attention mechanism to perform first fusion on the second image characteristic and the text characteristic to obtain a third image characteristic. And processing the third image characteristic based on a space attention mechanism to obtain a fourth image characteristic, and performing second fusion on the fourth image characteristic and the text characteristic to obtain a fusion characteristic.

Wherein the contextual attention mechanism relies on the implementation of a convolutional layer, which may be denoted as phi_CThe convolution kernel size of the convolution layer is the same as the dimension of the text feature and the second image feature, and a context attention weight is generated through the convolution layer; the context attention weight can be activated through a function, and after the context attention weight is activated through the function, the context attention weight can be applied to a process of performing first fusion on the second image feature and the text feature.

Wherein, the first image characteristic corresponding to the input image is recorded as

D is the number of channels (channels) corresponding to the first image feature, and H and W are the height and width of the feature map corresponding to the first image feature, respectively. The first image characteristic may be determined by the following formula:

C_i＝φ_C(X_i)，X_ithe first image characteristic corresponding to the ith input image.

The channel weight of the d channel of the ith input image may be determined by the following formula:

the image weight of the ith input image may be determined by the following formula:

the first image feature of the ith input image is weighted with the image of the ith input image by the following formula,

in particular implementations, the spatial attention mechanism enables feature enhancement of regions in the input image. As an example, the spatial attention mechanism may be implemented by a convolutional layer, and an attention matrix is generated from an output result of the convolutional layer based on the spatial attention mechanism, and then the attention matrix is activated and regularized, as shown in the following formula:

S_i＝φ_S(X′_i)

wherein phi denotes a convolutional layer,

and representing the output result after the regularization processing.

In step S206, the server determines output information corresponding to the input image based on the fusion feature.

In some optional embodiments, the server obtains a plurality of candidate output information corresponding to the fusion features and a confidence degree of each candidate output information through a visual question-answering model; and then carrying out normalization processing on the confidence level, and comparing the confidence level after the normalization processing to obtain the maximum value of the confidence level after the normalization processing, wherein the candidate output information corresponding to the maximum value of the confidence level is the output information corresponding to the input image. The confidence of the output information corresponding to the input image is greater than or equal to a confidence threshold; the confidence threshold can be flexibly set according to actual conditions, such as 75%.

In some optional embodiments, after determining the output information corresponding to the input image, the server sends the output information to the terminal device; the terminal device can present the output information to the user through the intelligent customer service in the application program.

The embodiment of the present disclosure further provides a visual question answering device, a structure of the visual question answering device is as shown in fig. 8, and the visual question answering device includes:

an obtaining module 401, configured to obtain an input image and an input text corresponding to the input image;

a first determining module 402, configured to determine a first image feature corresponding to the input image and a text feature corresponding to the text;

a feature fusion module 403, configured to perform at least two fusion processes based on the first image feature and the text feature to obtain a fusion feature;

a second determining module 404, configured to determine output information corresponding to the input image based on the fusion feature.

In some optional embodiments, the feature fusion module 403 is configured to process the first image feature based on the channel weight of the input image and the image weight of the input image, to obtain a second image feature; and performing at least two times of fusion processing on the second image characteristic and the text characteristic to obtain a fusion characteristic.

In some optional embodiments, the feature fusion module 403 is configured to perform weighting processing on the image features of the corresponding channels in the first image features respectively based on the channel weight corresponding to each channel in the input image, so as to obtain channel image features; and carrying out weighting processing on the channel image characteristics based on the image weight of the input image to obtain the second image characteristics.

In some optional embodiments, the feature fusion module 403 is configured to perform a first fusion process on the second image feature and the text feature to obtain a third feature image;

and performing second fusion processing on the third image characteristic and the text characteristic to obtain the fusion characteristic.

In some optional embodiments, the feature fusion module 403 is configured to add the second image feature and the text feature to obtain the third image feature.

In some alternative embodiments, the feature fusion module 403 is configured to determine a spatial attention weight coefficient for each pixel in the input image;

carrying out weighting processing on the third image characteristic by using the spatial attention weight coefficient to obtain a fourth image characteristic;

and adding the fourth image characteristic and the text characteristic to obtain the fusion characteristic.

In some optional embodiments, the second determining module 404 is configured to determine at least one candidate output information corresponding to the input image based on the fusion feature;

determining a confidence level for each of the candidate output information;

and determining the candidate output information with the highest confidence coefficient as the output information corresponding to the input image.

In some optional embodiments, the second determining module 404 is further configured to discard all the candidate output information if the confidence of the candidate output information is less than the confidence threshold.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 9 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. In some alternative embodiments, the electronic device 800 may be a terminal device or a server. In some alternative embodiments, the electronic device 800 may implement the visual question-answering method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; may be a local (Native) Application (APP), i.e. a program that needs to be installed in the operating system to run; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

In practical applications, the electronic device 800 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a Cloud server providing basic Cloud computing services such as a Cloud service, a Cloud database, Cloud computing, a Cloud function, Cloud storage, a network service, Cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, where Cloud Technology (Cloud Technology) refers to a hosting Technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement computing, storage, processing, and sharing of data. The electronic device 800 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart television, a smart watch, and the like.

Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, in-vehicle terminals, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the visual question-answering method. For example, in some alternative embodiments, the visual question-answering method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some alternative embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM802 and/or the communication unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more of the steps of the visual question-answering method described above. Alternatively, in other embodiments, the computing unit 801 may be configured as a visual question-and-answer method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the traffic identification restriction methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of visual question answering, comprising:

acquiring an input image and an input text corresponding to the input image;

2. The method of claim 1, wherein the performing at least two fusion processes based on the first image feature and the text feature to obtain a fused feature comprises:

processing the first image feature based on the channel weight of the input image and the image weight of the input image to obtain a second image feature;

and performing at least two times of fusion processing on the second image characteristic and the text characteristic to obtain a fusion characteristic.

3. The method of claim 2, wherein the processing the first image feature based on the channel weights of the input image and the image weights of the input image to obtain a second image feature comprises:

respectively performing weighting processing on the image features of the corresponding channels in the first image features based on the channel weight corresponding to each channel in the input image to obtain channel image features;

and carrying out weighting processing on the channel image characteristics based on the image weight of the input image to obtain the second image characteristics.

4. The method of claim 2, wherein the fusing the second image feature and the text feature to obtain a fused feature comprises:

performing first fusion processing on the second image characteristic and the text characteristic to obtain a third characteristic image;

5. The method of claim 4, wherein the performing the first fusion process on the second image feature and the text feature to obtain a third feature image comprises:

and adding the second image characteristic and the text characteristic to obtain a third image characteristic.

6. The method according to claim 4, wherein the second fusing processing of the third image feature and the text feature to obtain the fused feature comprises:

determining a spatial attention weight coefficient for each pixel in the input image;

7. The method of claim 1, wherein the determining output information corresponding to the input image based on the fused feature comprises:

determining at least one candidate output information corresponding to the input image based on the fusion feature;

determining a confidence level for each of the candidate output information;

8. The method of claim 7, wherein the method further comprises:

and in response to the confidence degrees of the candidate output information are all smaller than the confidence degree threshold value, discarding all the candidate output information.

9. A visual question-answering device, comprising:

10. The visual question-answering device according to claim 9, wherein the feature fusion module is configured to process the first image feature based on a channel weight of the input image and an image weight of the input image to obtain a second image feature;

11. The visual question-answering device according to claim 10, wherein the feature fusion module is configured to perform weighting processing on image features of corresponding channels in the first image features respectively based on channel weights corresponding to each channel in the input image, so as to obtain channel image features;

12. The visual question-answering device according to claim 10, wherein the feature fusion module is configured to perform a first fusion process on the second image feature and the text feature to obtain a third feature image;

13. The visual question-answering device according to claim 12, wherein the feature fusion module is configured to add the second image feature and the text feature to obtain the third image feature.

14. The visual question answering device according to claim 12, wherein the feature fusion module is configured to determine a spatial attention weight coefficient for each pixel in the input image;

15. The visual question answering device according to claim 9, wherein the second determining module is configured to determine at least one candidate output information corresponding to the input image based on the fusion feature;

determining a confidence level for each of the candidate output information;

16. The visual question-answering device according to claim 15, wherein the second determining module is configured to discard all of the candidate output information in response to the confidence level of the candidate output information being less than a confidence threshold.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 8.

19. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any one of claims 1 to 8.