CN114417044B

CN114417044B - Image question and answer method and device

Info

Publication number: CN114417044B
Application number: CN202210061065.1A
Authority: CN
Inventors: 张美美; 陈方
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2023-05-26
Anticipated expiration: 2042-01-19
Also published as: CN114417044A

Abstract

The invention relates to the technical field of computer vision and natural language processing, and provides a method and a device for image question-answering, wherein in the embodiment, the method comprises the following steps: determining an image; determining a problem vector corresponding to a problem of the image; extracting text features of the problem vectors, and determining text features corresponding to the problem vectors; extracting image features of the image, and determining image features corresponding to the image; fusing the text features and the image features to determine fusion features; wherein the fusion features indicate the relevance of different regions and problems in the image; classifying based on the fusion features and the image features to understand the semantics of the region related to the problem in the image, and determining the answer of the problem corresponding to the image. By understanding and analyzing the region related to the problem in the image, the problem of the image can be solved more accurately.

Description

Image question and answer method and device

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to a method and a device for image question answering.

Background

Visual questions and answers (Visual Question Answering, VQA) are provided as a type of interactive system combining computer vision and natural language processing, and aim to provide an interactive question and answer mode, and intelligently predict corresponding answers to questions according to input pictures and corresponding questions. At present, although visual questions and answers have achieved some results, most of the questions and answers are unfolded around natural images, and application scenes of remote sensing images are different from the natural images. The natural images have depth of field difference, people often focus on salient targets in the images, but all targets in the remote sensing images are under the same depth of field, and the existing visual question-answering technology can not answer the problem of the remote sensing images accurately.

At present, related researches on visual questions and answers of remote sensing images are few, and a method for realizing the visual questions and answers of the remote sensing images is needed.

Disclosure of Invention

The invention provides an image question-answering method, an image question-answering device, a computer-readable storage medium and electronic equipment, which are used for carrying out understanding analysis on the area related to the problem in the image based on the relation between the area of the image and the problem and the image characteristics, so that the problem of the image can be solved more accurately.

In a first aspect, the present invention provides a method for image question answering, including:

determining an image;

determining a problem vector corresponding to the problem of the image;

extracting text features of the problem vectors, and determining text features corresponding to the problem vectors;

extracting image features of the image, and determining image features corresponding to the image;

fusing the text features and the image features to determine fusion features; wherein the fusion feature indicates a correlation of the problem and different regions in the image;

and classifying based on the fusion features and the image features to understand the semantics of the region related to the problem in the image, and determining the answer of the problem corresponding to the image.

In a second aspect, the present invention provides an image question-answering apparatus, including:

an image determining module for determining an image;

the vector determining module is used for determining a problem vector corresponding to the problem of the image;

the text feature extraction module is used for extracting text features of the problem vectors and determining text features corresponding to the problem vectors;

the image feature extraction module is used for extracting image features of the image and determining image features corresponding to the image;

the fusion module is used for fusing the text features and the image features and determining fusion features; wherein the fused feature indicates a feature of an area in the image that is relevant to the problem;

and the answer prediction module is used for classifying the fusion characteristics and the image characteristics to understand the semantics of the area related to the question in the image and determining the answer of the question corresponding to the image.

In a third aspect, the present invention provides a computer readable storage medium comprising execution instructions which, when executed by a processor of an electronic device, perform the method of any of the first aspects.

In a fourth aspect, the present invention provides an electronic device comprising a processor and a memory storing execution instructions, the processor performing the method according to any one of the first aspects when executing the execution instructions stored in the memory.

The invention provides a remote sensing image visual question-answering method, a device, a computer readable storage medium and electronic equipment, wherein the remote sensing image visual question-answering method takes a problem as a guide, and the method determines a problem vector corresponding to the problem of an image; then, extracting text features of the problem vectors, and determining text features corresponding to the problem vectors; then, extracting image features of the image, and determining image features corresponding to the image; then, fusing the text features and the image features to determine fusion features, wherein the fusion features indicate the relation between the problems and different areas in the image; and classifying based on the fusion characteristics and the image characteristics, understanding the semantics of the area related to the problem in the image, and determining the answer of the problem corresponding to the image. In summary, through the technical scheme of the invention, based on the relation between the image area and the problem and the image characteristics, the area related to the problem in the image is understood and analyzed, so that the problem of the image can be solved more accurately.

Further effects of the above-described non-conventional preferred embodiments will be described below in connection with the detailed description.

Drawings

In order to more clearly illustrate the embodiments of the invention or the prior art solutions, the drawings which are used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only some of the embodiments described in the present invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic flow chart of a question-answering model training provided in an embodiment of the present invention;

FIG. 2 is a schematic flow chart of processing images and question text of a question-answering model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an image feature extraction module according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for image question answering according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an image question-answering apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a schematic flow chart of a question-answering model training according to an embodiment of the present invention. In this embodiment, as shown in fig. 2, the question-answer model includes an image feature extraction model, a text feature extraction model, a semantic analysis model, a fusion model and an answer prediction model, and the process of question-answer model training includes the following steps:

and 101, constructing an image library, a question library and an answer library.

Image datasets are produced with scene diversity and problem complexity. The image dataset comprises images of different scenes. Then, creating a question library formed by one or more questions of each image in the image dataset; the answer library is formed by answers to questions in the question library. The questions of the images are designed manually, no template is adopted, the questions are directly put forward, the answers are answered by different people for many times, the samples with the same answers are reserved, and in addition, the questions and the answers relate to not only the local targets of the images but also the understanding of the scenes of the images. In addition, in order to increase the diversity of question and answer libraries, similar questions adopt different expression modes, and question binary questions are avoided.

By way of example, the problem may be a local problem, i.e. a problem related to the object in the image, such as the presence of several oil tanks in the image; the question may be a global question, i.e. what the image as a whole wants to express, such as what the scene of the image is.

And 102, determining a word vector library based on the question library and the answer library, and encoding each question in the question library into a question vector based on the word vectors in the word vector library to form a question vector library.

And establishing a word library, namely summarizing all the words appearing in the question library and the answer library, establishing an index for each word, and determining the code of each word. Then, training a word vector encoder based on the word codes as inputs of the word vector encoder, and determining word vectors of each word in the word stock through the trained word vector encoder.

As one possible implementation, the word vector encoder is a word2vec model. The Word2vec model is a natural language processing model for vectorizing words, and is called Word to Vector in English. The Word2Vec model is characterized in that all words can be vectorized, so that the relation between the words can be quantitatively measured, and the relation between the words can be mined. word2vec is an efficient tool for Google (Google) corporation to open in 2013, which characterizes words in natural language as a dense vector of real values. The word2vec model is a three-layer neural network capable of carrying out word co-occurrence analysis, the word2vec model is trained by a large number of corpuses, the processing of words appearing in the corpus can be converted into the operation of vectors in a K-dimensional vector space, the similarity of the words can be represented by the similarity among the vectors, and further the similarity of text semantics can be represented. Thus, many natural language processing related tasks can be done using word vectors resulting from training the word2vec model.

The architecture of the word2vec model may be a skip-gram architecture or a CBOW architecture, where different model architectures correspond to different training data generation formats.

And determining the word vector of each word in the word stock based on the trained word2vec model, thereby forming a word vector stock. The word vector library is shown in table 1 below.

TABLE 1

In table 1, the word vector corresponding to the word "scene" is (X11, X12, X13, X14 … … X1 k); the word vectors corresponding to the word "oil storage tank" are (X21, X22, X23, X24 … … X2 k); the word vector corresponding to the word "residential area" is (X31, X32, X33, X34 … … X3 k). Typically, k is set to 50 or 100. It should be understood that X11, X12, X13, X14 … … X1k, etc. are merely symbols for simplicity of elements representing each dimension in the present invention. The elements of each dimension in the word vector are actually real or floating point numbers. For example, X11, X12, X13, X14, X1k may be 0.1265, 0.009635, 0.07018, -0.009701, -0.2122, respectively.

In practical application, considering that the lengths of the questions are different, for the sake of unification, the lengths of the questions are uniformly set to be constant values, for example, 18, and the questions of 18 words are not enough, and 0 values are used for filling, so that the consistency of the lengths of all question vectors is ensured.

And 103, substituting the problem vector into the text feature extraction model for each vector of the problem vector library to extract text features, and determining the text features corresponding to the problem vector output by the text feature extraction model.

The text features contain semantic information of the image corresponding to the problem vector.

As a possible implementation, the text feature extraction model may be a two-layer LSTM (Long Short-Term Memory network).

Step 104, for each vector of the problem vector library, carrying out semantic analysis on text features corresponding to the problem vector, and determining semantic weights corresponding to a plurality of scales.

For the image, features of the image may be extracted, and a plurality of features at a plurality of scales may be obtained. In order to better represent the relation between the feature and the problem, the text feature corresponding to the problem vector is subjected to semantic analysis to determine the semantic weights corresponding to the multiple scales, and then the image feature can be obtained by giving different weights to the feature of the different scales based on the semantic weights, so that the image feature and the problem are associated.

In one example, semantic analysis is performed on text features corresponding to the image through a semantic analysis model, and semantic weights corresponding to the scales are determined. Wherein the semantic weights indicate the problems corresponding to the text features and the relationships of the features at different scales.

Step 105, substituting the images into an image feature extraction model for each image of the image library to extract image features, and determining a plurality of features under a plurality of scales output by the image feature extraction model; image features are determined based on the semantic weights and the plurality of features corresponding to each of the plurality of scales.

Wherein the image features contain features related to the semantics of the problem.

As one possible implementation, the image features include multiple features of the same scale. Wherein image features may exhibit relationships between features of different scales and problems, thereby facilitating a more accurate determination of semantics of regions in an image that are relevant to the problem.

In one example, for a plurality of scales, weighting features corresponding to the scales based on semantic weights corresponding to the scales, and when the scales do not meet preset scales, sampling the weighted features to determine features meeting the preset scales; and when the scale of the feature meets the preset scale, taking the weighted feature as the feature meeting the preset scale. It should be appreciated that by assigning different weights to features of different scales, the association of features and questions is achieved, and then feature fusion of image features can be guided based on the questions to achieve understanding of semantics of the region in the image to which the questions relate.

In one example, as shown in FIG. 3, the image feature extraction model may be a VGG model.

The size 448×448×3 pixels in the VGG16 model and the image library are exemplified below.

In practical applications, a VGG16 model may be selected, and the model is pre-trained on ImageNet to obtain a trained VGG16 model. Wherein ImageNet is a large visual database for visual object recognition software research. More than 1400 tens of thousands of image URLs are manually annotated by ImageNet to indicate objects in the picture; in at least one million images, a bounding box is also provided.

As shown in fig. 2, for each image in the image library, the image is input into the VGG16 model after training, and then the features of the first scale (224×224×64 pixels), the features of the second scale (112×112×256 pixels), and the features of the third scale (56×56×512 pixels) are extracted from the image library respectively, where the first features of the three scales may represent different information, and in general, the smaller the scale, the better the global information of the image, for example, the local information of the image may be represented by the features of the first scale, and the global information of the image may be represented by the features of the third scale may be better. Here, the global information is used to represent a scene of an image. The features of the first scale comprise 64 224 x 224 plots, the features of the second scale comprise 256 112 x 112 plots, and the features of the third scale comprise 512 56 x 56 plots.

Then, for the features of the first scale, the 64 224×224 graphs are weighted by semantic weights corresponding to the first scale, and then downsampling is performed on the weighted graphs by downsampling, so as to determine sampled features, wherein the sampled features comprise 64 56×56 graphs. The processing of the second scale features and the third scale features are similar, and the finally obtained image features comprise 832 graphs of 56×56.

And 106, fusing image features and text features corresponding to each problem vector in the problem vector library through a fusion model, and determining fusion features corresponding to the problem vector, wherein the fusion features indicate the correlation between the problem corresponding to the problem vector and different areas in the image.

As a possible implementation manner, the text features and the image features are fused through a fusion model, and the fusion features are determined.

The fusion model adopts the principle of an attention mechanism to fuse the characteristics of different modes. Specifically, the attention mechanism regards the text feature as a query, and the key area in the image feature is deactivated, namely, the image feature is stretched into a vector, then the vector is connected with the text feature represented by the vector and then is subjected to nonlinear activation, and the result after the nonlinear activation is subjected to image restoration to obtain an attention result, namely, a fusion feature.

In one example, image features are stretched into a first vector; vector splicing is carried out on the first vector and the text feature, and a second vector is determined; matrixing the second vector to determine fusion features; the scale of the fusion feature is the same as that of the image feature.

Describing an example that the image features comprise 832 56×56 features, stretching 832 56×56 features into a first vector, and performing vector splicing on the first vector and text features to obtain a second vector; nonlinear activation of the second vector yields a 56 x 56 fusion feature.

It should be noted that, the fusion model is based on the correlation between the question corresponding to the question vector indicated by the fusion feature and different regions in the image, so that the answer prediction model focuses on the region related to the question in the image.

In addition, in some possible cases, the fusion features and images may be visualized to better see if the fusion model understands the relationship of the problem to the content of the image.

Step 107, for each question vector in the question vector library, inputting the corresponding fusion feature and image feature into the answer prediction model for classification, and determining the respective probability of each answer in the answer library output by the answer prediction model; and determining the answer corresponding to the question vector based on the respective probabilities of the answers in the answer library.

As a possible implementation manner, as shown in fig. 3, the fusion features and the image features are input into an answer prediction model to be classified and non-linearly activated, and probability distribution is determined, wherein the probability distribution indicates the respective probability of each answer in the answer library; and then, the highest value is selected from the sequences, so that a final answer can be obtained.

It is to be understood that the answer prediction model can quickly locate key areas in the image according to the relationship between the questions and the areas of the image, and give corresponding answers. For example, for the problem of local targets, how many oil storage tanks are in the image, the classification model can be quickly positioned on the oil storage tanks; for the problem of scene understanding, such as what scene the picture shoots, the answer prediction model can pay more attention to buildings and roads in the image and obtain the answer "residential area".

Notably, the image features can show the relation between the features and the questions, and the answer prediction model processes the image features, so that feature fusion of the image features is guided based on the questions, and finally, accurate understanding of semantics of the relevant areas of the questions in the image is realized.

In addition, in some possible cases, the questions and the answers corresponding to the question vectors can be visualized, so that the prediction effect of the answer prediction model is better observed.

And step 108, training an image feature extraction model, a text feature extraction model, a fusion model and an answer prediction model for answers corresponding to each question vector in the question vector library and answers corresponding to each question vector in the answer library.

Determining a loss value based on answers corresponding to each question vector in the question vector library and answers corresponding to each question vector in the answer library by combining a preset loss function; and updating parameters of the image feature extraction model, the text feature extraction model, the fusion model and the answer prediction model based on the loss value and the loss function. It should be understood that the training method is a conventional method in the prior art, and this embodiment will not be described in detail.

According to the technical scheme, the beneficial effects of the embodiment are as follows:

in the first aspect, question types that can be asked are more diverse. The existing remote sensing image question-answer data sets are single in scenes, mainly relate to residential area street view diagrams and farmlands, and therefore the design of the problems can only be developed along the two scenes. The data set is marked completely manually, the scene types are more various, the problems are more abundant, and the question-answering model trained based on the data set is more intelligent.

In a second aspect, a problem-oriented image feature fusion mechanism is added. The problem often contains semantic information related to scene understanding or local targets of the image, so that an image feature fusion mechanism guided by the semantics of the problem is added, image features related to the semantics of the problem are better focused, and the accuracy of model question answering is improved.

In a third aspect, the attention mechanism may be such that the answer prediction model focuses on the image region relevant to the question and the degree of understanding of the image and the question by the visualization model.

Further, the inventor verifies the model effect provided by the embodiment of the invention. The specific experimental procedure is as follows.

1. Experimental conditions

The invention relates to a CPU which is

i5-9400F 2.9GHz CPU, GTX 1660T 6G GPU and memory 16G, WINDOWS are implemented by using python language programming on an operating system. The model also relates to a deep learning framework, and the deep learning framework adopted in the experiment is tensorsurface 1.14. The data used in the experiments were all artificial markers.

2. Experimental details

The model accuracy is assessed by calculating the proportion of correctly predicted answers to all answers. The corresponding model parameters were set as follows:

during training, batch size was set to 8, 100 epochs were trained, dropout was set to 0.5 to prevent overfitting, and the optimizer selected adam optimizer with a learning rate set to 0.0005. The input image size is 448 x3 pixels and the input question word length is set to 18. The problem word is encoded into a vector, and then the input word and the problem vector are input into VGG16 and LSTM, respectively, to obtain 56×56×832 image features and 2048-dimensional problem features.

To verify the effectiveness of the invention, we will issue on the IEEE TGRS journal "Lobry s, marcos d, murray j.and Tuia d." RSVQA: visual Question Answering for Remote Sensing Data "IEEE Transactions on Geoscience and Remote Sensing, 2020", compare the model RSVQA of the remote sensing image visual question and answer, which is simpler, the image feature extraction module uses the res net-152 network, the text feature extraction uses skip thoughts, and the features are directly multiplied to obtain the fusion feature. The model provided by the embodiment of the invention introduces problem guidance to guide the fusion of image features with different scales, the extracted image features comprise visual features related to problem semantics, the model is more suitable for visual question answering of remote sensing images, and the precision of the final model is the comparison result shown in the following table 1:

TABLE 1

As can be seen from Table 1, compared with RSVQA, the model provided by the embodiment of the invention has higher precision, and because the question guide is introduced to guide the fusion of image features, the visual features related to the question semantics can be obtained more effectively, so that the question and answer precision is higher, and the question and answer types are more abundant. The method is more suitable for visual question answering of remote sensing images.

As shown in FIG. 4, a method for image question answering is provided in an embodiment of the present invention. The method provided by the embodiment of the invention can be applied to electronic equipment, and particularly can be applied to a server or a general computer. In this embodiment, the method specifically includes the following steps:

step 401, determining an image.

In practical application, an image library, a question library and an answer library are designed. Correspondingly, the image is a related image in the image library, or an image of a question in the question library and an answer to the question in the answer library.

As a possible implementation manner, the image may be a remote sensing image, for example, may be a residential area image, a farmland image, a forest image, or the like, which is not limited in this embodiment, and may be specifically determined in combination with actual requirements.

Step 402, determining a problem vector corresponding to the problem of the image.

In practical application, a word vector library is determined based on a designed question library and an answer library, and questions of an image are encoded into question vectors based on word vectors in the word vector library. Details are referred to above, and will not be described in detail here.

By way of example, the problem of the image may be a local problem, typically a problem related to the object in the image.

By way of example, the problem of an image may be a global problem, typically a content-related problem of the overall presentation of the image. Such as what the scene of the image is, what season the vegetation in the image is.

And 403, extracting text features of the problem vector, and determining text features corresponding to the problem vector.

As a possible implementation manner, the image feature of the image is extracted through the image feature model, and the feature of each of a plurality of scales of the output of the image feature model is taken as the image feature.

Details are described above, and will not be repeated here.

And 404, extracting image features of the image, and determining the image features corresponding to the image.

As a possible implementation manner, extracting image features of the image, and determining a plurality of first features with different scales; carrying out semantic analysis on the text features and determining semantic weights corresponding to the first features; image features are determined based on the semantic weights corresponding to each of the plurality of first features and the plurality of first features. In this implementation, because a focus of question guidance is introduced to guide the feature, the image feature related to the question can be obtained more effectively. Wherein the image features indicate the relationship between features and questions of different scales, for example, the larger the semantic weight of a scale, the higher the correlation between the features and questions at that scale, thereby facilitating a more accurate determination of the semantics of the region of the image that is relevant to the questions.

In one example, the image features include a plurality of second features that satisfy a preset scale. The image characteristics can be determined by the following implementation modes:

for each graph of the plurality of first features, weighting the first features based on semantic weights corresponding to the first features, and when the scale of the first features does not meet a preset scale, sampling the weighted first features to determine second features meeting the preset scale; and when the scale of the first feature meets the preset scale, taking the weighted first feature as a second feature.

In practical application, the image is input into the image feature extraction model, and the image features output by the image feature extraction model can be determined.

Details are described above, and will not be repeated here.

Step 405, fusing the text features and the image features to determine fusion features, wherein the fusion features indicate correlations between different areas and problems in the image.

In the step, through analysis of text features and image features, the relation between the region in the image and the problem is searched, and then the answer of the problem can be understood aiming at the content of the related region.

As one possible implementation, the image features are stretched into a first vector; vector stitching is carried out based on the first vector and the text features, and a second vector is determined; matrixing the second vector to determine fusion features; the scale of the fusion feature is the same as that of the image feature.

Details are described above, and will not be repeated here.

Step 406, classifying based on the fusion feature and the image feature to understand the semantics of the region related to the question in the image, and determining the answer of the question corresponding to the image.

In the step, through the relation between the region in the image and the question, the analysis and understanding can be carried out on the region related to the question in the image, so that the question can be answered more accurately.

In practical application, an answer library is established, and each answer in the answer library has a different answer number.

As a possible implementation, classifying based on image features, determining semantics of different regions in the image; and paying attention to the semantics of different areas in the image based on the fusion features so as to understand the semantics of the areas related to the problems in the image, and determining the answers of the problems corresponding to the image.

In one example, the fusion features and the image features are input into an answer prediction model for classification, then nonlinear activation is carried out on classification results, probability distribution is determined, and the probability distribution indicates the probability corresponding to each answer number in an answer library; and determining answers of the questions corresponding to the images based on the probability distribution, for example, sorting probabilities corresponding to a plurality of preset answer numbers to obtain the highest value, so that the answers can be coded in an answer library, and obtaining the answers of the questions corresponding to the images based on the answers corresponding to the numbers in the answer library.

the text features and the image features can be fused, the relation between different areas in the image and the questions is determined, and answer prediction is performed based on the relation, so that the accuracy of questions and answers is improved.

Based on the same concept as the method embodiment of the present invention, please refer to fig. 5, the embodiment of the present invention further provides an image question-answering apparatus, which includes:

an image determination module 501 for determining an image;

a vector determining module 502, configured to determine a problem vector corresponding to a problem of the image;

a text feature extraction module 503, configured to perform text feature extraction on the problem vector, and determine a text feature corresponding to the problem vector;

an image feature extraction module 504, configured to perform image feature extraction on the image, and determine an image feature corresponding to the image;

the fusion module 505 is configured to fuse the text feature and the image feature, and determine a fusion feature; wherein the fused feature indicates a feature of an area in the image that is relevant to the problem;

and an answer prediction module 506, configured to classify the fusion feature and the image feature to understand the semantics of the region related to the question in the image, and determine an answer to the question corresponding to the image.

As a possible implementation manner, the image feature extraction module 504 includes: the device comprises a first feature determining unit, a semantic analysis unit and a second feature unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the first image determining unit is used for extracting image features of the image and determining a plurality of first features with different sizes;

the semantic analysis unit is used for carrying out semantic analysis on the text features and determining semantic weights corresponding to the first features;

the second feature determining unit is used for determining image features based on the semantic weights corresponding to the first features and the first features.

As a possible implementation manner, the image feature includes a plurality of second features satisfying a preset scale, and the second feature determining unit includes: a weighted sampling subunit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the weighted sampling subunit is configured to, for each graph of the plurality of first features, weight the first features based on semantic weights corresponding to the first features, and sample the weighted first features when the scale of the first features does not meet a preset scale, to determine second features that meet the preset scale; and when the scale of the first feature meets the preset scale, taking the weighted first feature as a second feature.

As a possible implementation, the answer prediction module 506 includes: an image understanding unit and an answer understanding unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the image understanding unit is used for classifying based on the image characteristics and determining the semantics of different areas in the image;

and the answer understanding unit is used for paying attention to the semantics of different areas in the image based on the fusion characteristics so as to understand the semantics of the areas related to the questions in the image and determining the answer of the questions corresponding to the image.

As a possible implementation manner, the fusing module 505 includes: the device comprises a stretching unit, a splicing unit and a matrix subunit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the stretching unit is used for stretching the image features into first vectors;

the splicing unit is used for carrying out vector splicing on the first vector and the text feature and determining a second vector;

the matrixing unit is used for matrixing the second vector and determining fusion characteristics; the scale of the fusion feature is a second preset scale.

As a possible implementation, the problem is a global problem or a local problem.

As a possible implementation, the local problem indicates a target in the image and the global problem indicates a scene of the image.

As a possible implementation manner, the apparatus further includes a question-answer model determining module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the question-answer model determining module is used for determining a question-answer model, and the question-answer model comprises an image feature extraction model, a text feature extraction model, a fusion model and an answer prediction model; the image feature extraction model is used for extracting image features of the image and determining image features corresponding to the image; the text feature extraction model is used for extracting text features of the problem vectors and determining text features corresponding to the problem vectors; the fusion model is used for fusing the text features and the image features and determining fusion features; wherein the fused feature indicates a feature of an area in the image that is relevant to the problem; and the answer prediction model is used for detecting the area related to the problem in the fusion characteristic and determining the answer of the problem corresponding to the image.

As a possible implementation manner, the device further comprises a labeling answer determining module and a training module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the annotation answer determining module is used for determining an annotation answer corresponding to the problem of the image;

the training module is used for training the question-answer model based on errors between answers corresponding to the questions of the image and the marked answers.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. At the hardware level, the electronic device comprises a processor 601 and a memory 602 storing executable instructions, optionally together with an internal bus 603 and a network interface 604. The Memory 602 may include a Memory 6021, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory 6022 (non-volatile Memory), such as at least 1 disk Memory, and the like; the processor 601, the network interface 604, and the memory 602 may be connected to each other through an internal bus 603, and the internal bus 603 may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, etc.; the internal bus 603 may be divided into an address bus, a data bus, a control bus, etc., and is represented by only one double-headed arrow in fig. 6 for convenience of illustration, but does not represent only one bus or one type of bus. Of course, the electronic device may also include hardware required for other services. When the processor 601 executes the execution instructions stored in the memory 602, the processor 601 performs the method of any one of the embodiments of the present invention and is at least used to perform the method as shown in fig. 1 or fig. 4.

In one possible implementation manner, the processor reads the corresponding execution instruction from the nonvolatile memory into the memory and then executes the execution instruction, and may also obtain the corresponding execution instruction from other devices to form an image question-answering device on a logic level. The processor executes the execution instructions stored in the memory to implement an image question-answering method provided in any embodiment of the present invention by executing the execution instructions.

The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The embodiment of the invention also provides a computer readable storage medium, which comprises execution instructions, when the processor of the electronic device executes the execution instructions, the processor executes the method provided in any embodiment of the invention. The electronic device may specifically be an electronic device as shown in fig. 6; the execution instruction is a computer program corresponding to the image question-answering device.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method or a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware aspects.

The embodiments of the present invention are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims

1. A method of image question answering, the method comprising:

determining an image, wherein the image is a remote sensing image;

determining a problem vector corresponding to the problem of the image;

extracting image features of the image, determining a plurality of first features with different scales, performing semantic analysis on the text features, determining semantic weights corresponding to the first features, and determining image features corresponding to the image based on the semantic weights corresponding to the first features and the first features, wherein the first features with different scales are graphs representing different information of the image; wherein the image features contain features related to the semantics of the problem;

fusing the text features and the image features to determine fusion features; wherein, based on the principle of the attention mechanism, the fusion feature indicates the relevance of different areas in the image and the problem, the fusion feature being an attention result;

classifying based on the image features, and determining semantics of different areas in the image; and paying attention to the semantics of different areas in the image based on the fusion features so as to understand the semantics of the areas related to the questions in the image, and determining the answers of the questions corresponding to the image.

2. The method of claim 1, wherein the image features comprise a plurality of second features satisfying a preset scale, the determining the image features based on the semantic weights and the plurality of first features corresponding to each of the plurality of first features comprising:

for each feature of the plurality of first features, weighting the first features based on semantic weights corresponding to the first features, and when the scale of the first features does not meet a preset scale, sampling the weighted first features to determine second features meeting the preset scale; and when the scale of the first feature meets the preset scale, taking the weighted first feature as a second feature.

3. The method of claim 1, wherein the fusing the text feature and the image feature to determine a fused feature comprises:

stretching the image features into a first vector;

vector stitching is carried out based on the first vector and the text feature, and a second vector is determined;

matrixing the second vector to determine fusion features; the scale of the fusion feature is the same as the scale of the image feature.

4. The method of claim 1, wherein the problem is a global problem or a local problem.

5. The method of claim 4, wherein the local question indicates a target-related question in the image and the global question indicates a scene-related question of the image.

6. The method according to claim 1, wherein the method further comprises:

determining a question-answer model, wherein the question-answer model comprises an image feature extraction model, a text feature extraction model, a fusion model and an answer prediction model; the text feature extraction model is used for extracting text features of the problem vectors and determining text features corresponding to the problem vectors; the image feature extraction model is used for extracting image features of the image and determining image features corresponding to the image; the fusion model is used for fusing the text features and the image features and determining fusion features; the answer prediction model is used for classifying based on the fusion feature and the image feature to understand the semantics of the region related to the question in the image, and determining the answer of the question corresponding to the image.

7. The method of claim 6, wherein the method further comprises:

determining a labeling answer corresponding to the problem of the image;

and training the question-answer model based on the error between the answer corresponding to the question of the image and the marked answer.

8. An image question-answering apparatus, comprising:

the image determining module is used for determining an image, wherein the image is a remote sensing image;

the image feature extraction module is used for extracting image features of the image and determining image features corresponding to the image; determining a plurality of first features with different scales, performing semantic analysis on the text features, determining semantic weights corresponding to the first features, and determining image features corresponding to the image based on the semantic weights corresponding to the first features and the first features, wherein the first features with different scales are graphs representing different information of the image; wherein the image features contain features related to the semantics of the problem;

the fusion module is used for fusing the text features and the image features and determining fusion features; wherein the fused feature is indicative of a feature of an area of the image that is relevant to the problem, the fused feature being a result of attention;

and the answer prediction module is used for classifying based on the image features, determining the semantics of different areas in the image, focusing on the semantics of different areas in the image based on the fusion features to understand the semantics of the areas related to the questions in the image, and determining the answer of the questions corresponding to the image.