CN111783809B

CN111783809B - Image description generation method, device and computer readable storage medium

Info

Publication number: CN111783809B
Application number: CN201910841842.2A
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2024-03-05
Anticipated expiration: 2039-09-06
Also published as: CN111783809A

Abstract

The present disclosure relates to a method, an apparatus, and a computer-readable storage medium for generating an image description, and relates to the field of computer technology. The method of the present disclosure comprises: constructing a semantic tree of the image according to each target in the image, the target frame of each target and the relation between the images; wherein, each node of the semantic tree corresponds to each target, each target frame and the image respectively; according to the relation of each node in the semantic tree, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by utilizing a tree-shaped long-short time memory network, and determining the fused target frame characteristics and the fused image global characteristics; the target frame features are features of images in the target frames of the targets; and generating a model by utilizing image description according to each target feature, each fused target frame feature and each fused image global feature, and determining the description text of the image.

Description

Image description generation method, device and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for generating an image description, and a computer readable storage medium.

Background

The automatic generation of the image description is to let a machine understand the image and automatically generate description text of the image.

At present, the automatic generation of image descriptions by using a cyclic depth neural network is a common method in academia.

Disclosure of Invention

The invention is as follows: the current image description generation method has the defects that the obtained description text is inaccurate, for example, descriptions of some targets are lacked, or the relationship among the targets cannot be embodied.

One technical problem to be solved by the present disclosure is: the accuracy of image description is improved.

According to some embodiments of the present disclosure, a method for generating an image description is provided, including: constructing a semantic tree of the image according to each target in the image, the target frame of each target and the relation between the images; wherein, each node of the semantic tree corresponds to each target, each target frame and the image respectively; according to the relation of each node in the semantic tree, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by utilizing a tree-shaped long-short time memory network, and determining the fused target frame characteristics and the fused image global characteristics; the target frame features are features of images in the target frames of the targets; and generating a model by utilizing image description according to each target feature, each fused target frame feature and each fused image global feature, and determining the description text of the image.

In some embodiments, building a semantic tree of an image includes: configuring a node corresponding to the image as a root node of the semantic tree; configuring nodes corresponding to each target frame as middle layer nodes of the semantic tree; configuring the nodes corresponding to the targets as leaf nodes of the semantic tree; the leaf nodes corresponding to the targets are configured as child nodes of the target frames corresponding to the targets.

In some embodiments, configuring the node corresponding to each target box as a middle level node of the semantic tree includes: arranging the target frames according to the sequence from large to small; sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence; determining the area of an overlapping area of a target frame corresponding to the node to be added and a target frame corresponding to each added node; and under the condition that the area of the overlapping area of the target frame corresponding to the node to be added exceeds a threshold value, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

In some embodiments, feature fusion using a tree long and short time memory network comprises: starting from a layer where leaf nodes of the semantic tree are located, inputting features corresponding to all child nodes belonging to the same father node and features corresponding to the father node into a tree-shaped long-short-time memory network to obtain fused features corresponding to the output father node, and updating the features corresponding to the father node into the fused features; sequentially updating the corresponding characteristics of the nodes of each layer according to the sequence from bottom to top; according to the updated characteristics corresponding to each node, determining the characteristics of each target frame after fusion and the global characteristics of the image after fusion; when the root node is used as a father node, carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

In some embodiments, generating a model using image description based on each target feature, each fused target frame feature, and the fused image global feature, determining description text of the image includes: combining the target feature with the corresponding target frame feature and the corresponding fused target frame feature aiming at each target feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; combining the first global target feature, the first global target frame feature and the fused image global feature to obtain a combined first global image expression feature; and inputting the first global image expression features and each first target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, inputting the first global image representation feature and each first target local feature into the image description generation model, obtaining the description text of the output image comprises: combining the characteristic of the description word at the current moment, the first global image expression characteristic and the characteristic output at the moment on the second layer long-short-time memory network of the image description generation model, and inputting the characteristic into the first layer long-short-time memory network of the image description generation model; the method comprises the steps of storing the characteristics output by a network for a first layer of long and short time, and inputting the local characteristics of each first target into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, generating a model using image description based on each target feature, each fused target frame feature, and the fused image global feature, determining description text of the image includes: inputting each target feature into a graph convolution network to obtain each target feature after being output and updated; inputting the fused target frame characteristics into a graph convolution network to obtain output updated fused target frame characteristics; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; combining the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature aiming at each updated target feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame feature to obtain a second global target frame feature; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature; and inputting the second global image expression features and each second target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, inputting the second global image representation feature and each second target local feature into the image description generation model, obtaining the description text of the output image comprises: inputting the characteristics of the description words at the current moment, the second global image expression characteristics and the characteristics output at the moment on a second layer long-short-time memory network of the image description generation model into the first layer long-short-time memory network of the image description generation model; the first layer of long-short-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, the method further comprises: performing target detection on the image to obtain each target frame and each segmentation area in the image; extracting features from the images in each target frame to obtain target frame features of each target frame in the output image; and extracting features from the images in each partitioned area to obtain target features of each target in the output image.

In some embodiments, extracting features from the images in each of the segmented regions, resulting in target features for each target in the output image includes: for the image in each divided area, the image in the divided area is set to white, and the image of the other part is set to black as a binarized image; superposing the binarized image with the original image to obtain an image with the background removed; and inputting the image with the background removed into an object detector to obtain target characteristics of targets in the output sub-regions.

In some embodiments, the method further comprises: obtaining a training sample, wherein the training sample comprises: a sample image and a description text corresponding to the sample image; acquiring target characteristics of each target in the sample image, and target frame characteristics of each target frame; according to each target in the sample image, the target frame of each target and the relation of the sample image, constructing a semantic tree of the sample image; according to the semantic tree, the target characteristics of each target in the sample image, the target frame characteristics of each target frame, the tree-shaped long-short time memory network to be trained and the image description generation model to be trained are trained.

In some embodiments, training the tree long and short time memory network to be trained and the image description generation model to be trained comprises: according to the relation of each node in the semantic tree of the sample image, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by utilizing a tree-shaped long-short time memory network to be trained, and determining each target frame characteristic after sample image fusion and the image global characteristic after fusion; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, generating a model by utilizing image description to be trained, and determining a description text of the output sample image; and adjusting parameters of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the description text of the output sample image and the description text of the marked sample image until the preset convergence condition is met, so that training of each model is completed.

According to other embodiments of the present disclosure, there is provided an image description generating apparatus including: the semantic tree construction module is used for constructing a semantic tree of the image according to each target in the image, the target frame of each target and the relation between the images; wherein, each node of the semantic tree corresponds to each target, each target frame and the image respectively; the feature fusion module is used for carrying out feature fusion by utilizing the tree long-short time memory network according to the relation of each node in the semantic tree, the target features of each target corresponding to the node and the target frame features of each target frame corresponding to the node, and determining the features of each target frame after fusion and the global features of the image after fusion; the target frame features are features of images in the target frames of the targets; and the description generation module is used for determining the description text of the image by utilizing the image description generation model according to each target feature, each fused target frame feature and the fused image global feature.

In some embodiments, the semantic tree construction module is configured to configure a node corresponding to the image as a root node of the semantic tree; configuring nodes corresponding to each target frame as middle layer nodes of the semantic tree; configuring the nodes corresponding to the targets as leaf nodes of the semantic tree; the leaf nodes corresponding to the targets are configured as child nodes of the target frames corresponding to the targets.

In some embodiments, the semantic tree construction module is configured to arrange the target frames in order of area from large to small; sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence; determining the area of an overlapping area of a target frame corresponding to the node to be added and a target frame corresponding to each added node; and under the condition that the area of the overlapping area of the target frame corresponding to the node to be added exceeds a threshold value, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

In some embodiments, the feature fusion module is configured to, starting from a layer where a leaf node of the semantic tree is located, input features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node into the tree-shaped long-short-term memory network to obtain fused features corresponding to the output parent node, and update the features corresponding to the parent node to the fused features; sequentially updating the corresponding characteristics of the nodes of each layer according to the sequence from bottom to top; according to the updated characteristics corresponding to each node, determining the characteristics of each target frame after fusion and the global characteristics of the image after fusion; when the root node is used as a father node, carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

In some embodiments, the description generating module is configured to combine, for each target feature, the target feature with a corresponding target frame feature and a corresponding fused target frame feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; combining the first global target feature, the first global target frame feature and the fused image global feature to obtain a combined first global image expression feature; and inputting the first global image expression features and each first target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, the description generating module is configured to combine the feature of the description word at the current moment, the first global image expression feature and the feature output at a moment on the second layer long-short-time memory network of the image description generating model, and input the combined feature into the first layer long-short-time memory network of the image description generating model; the method comprises the steps of storing the characteristics output by a network for a first layer of long and short time, and inputting the local characteristics of each first target into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, the description generation module is configured to input each target feature into the graph convolution network to obtain each target feature after the updating of the output; inputting the fused target frame characteristics into a graph convolution network to obtain output updated fused target frame characteristics; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; combining the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature aiming at each updated target feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame feature to obtain a second global target frame feature; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature; and inputting the second global image expression features and each second target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, the description generating module is configured to input the feature of the description word at the current moment, the second global image expression feature, and the feature output at a moment on the second layer long-short-time memory network of the image description generating model, into the first layer long-short-time memory network of the image description generating model; the first layer of long-short-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, the apparatus further comprises: the feature extraction module is used for carrying out target detection on the image to obtain each target frame and each segmentation area in the image; extracting features from the images in each target frame to obtain target frame features of each target frame in the output image; and extracting features from the images in each partitioned area to obtain target features of each target in the output image.

In some embodiments, the feature extraction module is configured to set, for each image in the divided areas, the image in the divided areas to white and the image of the other part to black as the binarized image; superposing the binarized image with the original image to obtain an image with the background removed; and inputting the image with the background removed into an object detector to obtain target characteristics of targets in the output sub-regions.

In some embodiments, the apparatus further comprises: the training module is used for obtaining training samples, and the training samples comprise: a sample image and a description text corresponding to the sample image; acquiring target characteristics of each target in the sample image, and target frame characteristics of each target frame; according to each target in the sample image, the target frame of each target and the relation of the sample image, constructing a semantic tree of the sample image; according to the semantic tree, the target characteristics of each target in the sample image, the target frame characteristics of each target frame, the tree-shaped long-short time memory network to be trained and the image description generation model to be trained are trained.

In some embodiments, the training module is configured to perform feature fusion by using a tree-shaped long-short time memory network to be trained according to a relationship of each node in a semantic tree of the sample image, the target feature of each target corresponding to the node, and the target frame feature of each target frame corresponding to the node, and determine each target frame feature after sample image fusion and a global feature of the image after fusion; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, generating a model by utilizing image description to be trained, and determining a description text of the output sample image; and adjusting parameters of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the description text of the output sample image and the description text of the marked sample image until the preset convergence condition is met, so that training of each model is completed.

According to still further embodiments of the present disclosure, there is provided an image description generating apparatus including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the method of generating an image description of any of the embodiments described above.

According to still further embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of generating an image description of any of the foregoing embodiments.

The method and the device divide the image into different layers of semantic information such as targets, target frames and images, construct a semantic tree of the image according to the relation among the targets, the target frames of the targets and the images, and embody the semantic information of the different layers. Further, feature fusion is carried out by utilizing a tree long-short time memory network according to the semantic tree, so that each fused target frame feature and the fused image global feature are obtained, and the fused features reflect the relation of semantic information of different layers. And finally, generating a model by utilizing image description, and obtaining a description text of the image by utilizing each target feature, each fused target frame feature and the fused image global feature. The scheme of the present disclosure utilizes the semantic tree to mine and embody rich and multi-level semantic information of the image, and further obtains the relation between the features of different levels based on the semantic tree, so that the image description generation model can understand the multi-level semantic information of the image, and the generated description text information is richer and more accurate.

Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 shows a flow diagram of a method of generating an image description of some embodiments of the present disclosure.

Fig. 2A illustrates a schematic diagram of a semantic tree of an image according to some embodiments of the present disclosure.

Fig. 2B illustrates a structural schematic diagram of an image description generation model of some embodiments of the present disclosure.

FIG. 2C illustrates a schematic diagram of an image description generation model of further embodiments of the present disclosure.

Fig. 3 shows a flow diagram of a method of generating an image description of further embodiments of the present disclosure.

Fig. 4 shows a flow diagram of a method of generating an image description of further embodiments of the present disclosure.

Fig. 5 illustrates a schematic structural diagram of an image description generating apparatus of some embodiments of the present disclosure.

Fig. 6 shows a schematic structural diagram of an image description generating apparatus of other embodiments of the present disclosure.

Fig. 7 shows a schematic structural diagram of an image description generating apparatus of further embodiments of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

Aiming at the problem that the existing image description generation method is inaccurate, the scheme is provided. Some embodiments of the generation method of the image description of the present disclosure are described below in conjunction with fig. 1.

FIG. 1 is a flow chart of some embodiments of a method of generating an image description of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S106.

In step S102, a semantic tree of the image is constructed according to each object in the image, the object frame of each object, and the relationship between the images.

The target and target frame may be obtained by using a target detection method, which will be described in detail in the following embodiments. The target frame is a target frame (a Bounding Box) obtained when a target is detected, and can be understood as an image of the target frame. Each object corresponds to an object box, the image in the object box being part of the global image.

Each node of the semantic tree corresponds to each target, each target frame and image respectively. In some embodiments, the node to which the image corresponds is configured as the root node of the semantic tree. And configuring the nodes corresponding to the target frames as middle layer nodes of the semantic tree. And configuring the nodes corresponding to the targets as leaf nodes of the semantic tree. Each leaf node corresponding to the target is configured as a child node of the target frame corresponding to the target. The image (global image or a part of global image) corresponding to the parent node contains the image (a part of global image) corresponding to the child node. The image in the target frame includes the target and background images.

In some embodiments, the individual target boxes are arranged in order of area from large to small. And sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence. And determining the area of an overlapping area of the target frame corresponding to the node to be added and the target frame corresponding to each added node. And under the condition that the area of the overlapping area of the target frame corresponding to the node to be added exceeds a threshold value, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

As shown in fig. 2A, which is a semantic tree of images, the global image serves as a root node of the tree, and each target includes: tree, person, and hat and glasses worn by person, these targets act as leaf nodes. The target frames of the person, the tree, the cap, and the glasses are used as intermediate layer nodes. According to the arrangement of the areas of the target frames from large to small, firstly, the nodes corresponding to the target frames of the tree are used as nodes to be added, and the nodes corresponding to the target frames of the tree are used as child nodes of the root node to be added. Further, the node corresponding to the target frame of the person is added as a child node of the root node. And then taking the node corresponding to the target frame of the cap as a node to be added, and adding the node corresponding to the target frame of the cap as a child node of the node corresponding to the target frame of the person by comparing the target frame of the cap with the target frame of the person to find that the overlapping area exceeds the threshold value. Similarly, the node corresponding to the target frame of the glasses is used as a child node of the node corresponding to the target frame of the person to join. And finally, taking the node corresponding to each target as a child node of the corresponding target frame to be added. It can be seen that the image corresponding to the parent node and the image corresponding to the child node belong to the contained and contained relationship, and the image is divided into semantic information of different levels through the semantic tree.

In step S104, according to the relation of each node in the semantic tree, the target features of each target corresponding to the node and the target frame features of each target frame corresponding to the node, feature fusion is performed by using the tree long-short time memory network, and each fused target frame feature and the fused image global feature are determined.

The object frame features are features of the in-frame image of each object. The target features of the respective targets, the target frame features of the respective target frames may be extracted by a model such as an object detector, and the like, which will be described in the following embodiments.

Feature fusion is carried out by utilizing a Tree long and short time memory network (Tree-LSTM), namely, the features corresponding to all nodes in the semantic Tree are encoded, the features corresponding to all the nodes are formed into a Tree sequence according to the structure of the semantic Tree, the Tree long and short time memory network is input, and the relation of all the nodes is combined in the encoding process, so that the fused features comprise the relation of different nodes.

In some embodiments, starting from a layer where leaf nodes of the semantic tree are located, inputting features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node into a tree-shaped long-short-time memory network to obtain fused features corresponding to the output parent node, and updating the features corresponding to the parent node into the fused features. And repeating the process according to the sequence from bottom to top, and sequentially updating the corresponding characteristics of the nodes of each layer. And determining the characteristics of each target frame after fusion and the global characteristics of the image after fusion according to the characteristics corresponding to each updated node. For the nodes corresponding to each target, the characteristics corresponding to the nodes represent target characteristics, for the nodes corresponding to each target frame, the characteristics corresponding to the nodes represent target frame characteristics, and for the nodes corresponding to the image, the characteristics corresponding to the nodes represent image global characteristics.

Similar to LSTM, treeLSTM comprising memory cells c _j Hidden state h _j Input gate i _j And an output gate o _j J is a positive integer representing the index of a node in the semantic tree. Unlike LTSM, which updates memory cells based only on the previous hidden state, the memory cells corresponding to a parent node in Tree-LSTM are updated based on the hidden states of all child nodes of the parent node. Each child node in the Tree-LSTM also has a forget door f _ik K is a positive integer representing the index of a child node under the same parent node, for nodes j, x in the semantic tree _j ，h _j The fused features representing the input features and the output features, respectively, may be represented as feature vectors. The set of child nodes of this node is denoted C (j). W represents the input weight matrix, U represents the cyclic weight matrix (Recurrent Weight Matrices), and b represents the bias. sigmoid function sigma and hyperbolic tangent functionIs a nonlinear activation function of the element. As indicated by the dot product of the two vectors. The Tree-LSTM performs feature fusion, i.e., the update process may be calculated according to the following formula.

f _jk ＝σ(W _f x _j +U _f h _k +b _f )forget gate (4)

c _j ＝u _j ⊙i _j +∑ _k∈C(j) c _k ⊙f _jk cell state (6)

For the node corresponding to the target and the node corresponding to the target frame, respectively characterizing the targetAnd target frame feature->As input x of the node _j I is an index that represents a target by a positive integer. Carrying out average pooling operation on each target feature to obtain a first global target feature +. >Carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic +.>First global object feature->And a first global targeting feature->Weighting to obtain the corresponding characteristic of the input root node>

Updating the corresponding characteristics of each layer node from bottom to top through the Tree-LSTM, wherein the target frame characteristics are carried out according to the target characteristics and/or the context characteristics mined in the finer target frame characteristicsEnhanced, and the fused target frame characteristics are obtainedThe integrated image global feature Ih is endowed with multi-level information of the target, the target frame and the whole image.

In step S106, according to each target feature, each fused target frame feature and each fused image global feature, a model is generated by using image description, and the description text of the image is determined.

The image description generation model may employ an existing model, such as a attention-based top-Down long-short-time memory network decoding model (Attention based LSTM decoder in UP-Down), or GCN-LSTM (graph convolution-long-time memory network), etc., without being limited to the illustrated example.

The input of the image description generation model is improved. In some embodiments, for each target feature, combining the target feature with a corresponding target frame feature (the target feature corresponds to the target frame feature of the target frame where the target is located one by one), and a corresponding fused target frame feature to obtain a first target local feature A first set of target local features may be constructed. Carrying out average pooling operation on each target feature to obtain a first global target feature +.>Carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic +.>First global object feature->First global goal box feature->Fused image global features I ^h And merging to obtain the merged first global image expression characteristics. And inputting the first global image expression features and each first target local feature into an image description generation model to obtain the description text of the output image.

As shown in FIG. 2B, in some embodiments, the image description generation model includes a dual-layer LSTM and an attention mechanism module that is coupled to the first layer LSTM and the second layer LSTM, respectively. Description word w at the present moment _t (i.e. the last description word obtained) features, a first global image expression feature and a feature output at a moment on a second layer of long-short-term memory network of the image description generation modelAnd merging, and inputting a first layer of long-short-time memory network of the image description generation model. The first layer of long and short time memory network output characteristics +.>First target local features >An input attention mechanism module. The output characteristics of the attention mechanism module and the output characteristics of the first layer long-short time memory network are +.>Merging, inputting the second layer of long-short time memory network to obtain the description word w of the next moment _t+1 。

As shown in FIG. 2C, in some embodiments, the image description generation model includes a double-layer LSTM, an attention mechanism module, and a graph rolling network. In some embodiments, each target feature is identifiedInputting the graph convolution network to obtain the output updated target features +.>The characteristics of each target frame after fusion are +.>Inputting the graph convolution network to obtain the output updated fused target frame features ∈ ->Characterizing individual object frames>Inputting the graph convolution network to obtain the output updated target frame characteristics>The graph rolling network can update corresponding features according to the relation among the nodes, so that the features can reflect the relation among different targets, the generated description text can represent the relation among the targets, and the description accuracy is further improved.

For each updated target featureThe updated target feature +.>And corresponding updated target frame feature +. >Corresponding updated fused object box feature +.>And merging to obtain a second target local feature, and constructing a second target local feature set. The respective updated target feature +.>Performing an average pooling operation to obtain a second global target feature +.>The respective updated object frame features +.>Performing an average pooling operation to obtain a second global target frame feature +.>The respective updated fused object box features +.>Performing average pooling operation to obtain third global target frame feature +.>Combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature; and inputting the second global image expression features and each second target local feature into an image description generation model to obtain the description text of the output image.

Description word w at the present moment _t (i.e. the last description word obtained) features, a second global image expression feature and a feature output at a moment on a second layer of long-short-term memory network of the image description generation modelAnd merging, and inputting a first layer of long-short-time memory network of the image description generation model. The first layer of long and short time memory network output characteristics +. >The respective second target local feature is input to the attention mechanism module. Features and the first output of the attention mechanism moduleOne layer of long and short time memory network output characteristics +.>Merging, inputting the second layer of long-short time memory network to obtain the description word w of the next moment _t+1 。

Experiments prove that the scheme of generating the description text by using the image description generation model after the image semantic tree is constructed and the characteristics are fused according to the semantic tree and the tree-shaped long-short time memory network in the embodiment of the invention has obvious improvement on the accuracy of the description text generated in the prior art. For example, a picture including one giraffe and two zebra stripes may be on the tree, and the description text generated by the prior art method may be that a group of zebra stripes is on the side of one giraffe, and the scheme of the present disclosure may be accurately described as that one giraffe and two zebra stripes are on the tree.

The method of the embodiment divides the image into different layers of semantic information such as targets, target frames and images, constructs a semantic tree of the image according to the relation among each target, the target frames of each target and the image, and reflects the semantic information of different layers. Further, feature fusion is carried out by utilizing a tree long-short time memory network according to the semantic tree, so that each fused target frame feature and the fused image global feature are obtained, and the fused features reflect the relation of semantic information of different layers. And finally, generating a model by utilizing image description, and obtaining a description text of the image by utilizing each target feature, each fused target frame feature and the fused image global feature. The scheme of the embodiment utilizes the semantic tree to mine and embody rich and multi-level semantic information of the image, and further obtains the relation among the features of different levels based on the semantic tree, so that the image description generation model can understand the multi-level semantic information of the image, and the generated description text information is richer and more accurate.

Further embodiments of the image description generation method of the present disclosure are described below in conjunction with fig. 3.

FIG. 3 is a flow chart of further embodiments of a method of generating an image description of the present disclosure. As shown in fig. 3, the steps S102 to S106 include: steps S302 to S306.

In step S302, object detection is performed on the image, and each object frame and each divided area in the image are obtained.

The object detection may be performed using an existing model, for example, mask R-CNN (Mask cyclic convolutional neural network), and each object frame and divided region in the image may be obtained, which is not limited to the illustrated example. The Mask R-CNN belongs to a pixel-level semantic segmentation model, and the category of each pixel can be determined, so that the semantic segmentation of the image is realized. The divided region is a region surrounded by edge lines of the respective objects.

In step S304, features are extracted from the images in the respective target frames, and target frame features of the respective target frames in the output image are obtained.

The existing model can be used to extract features from the images in the respective target frames, for example, using the object detector Faster R-CNN (Faster cyclic convolutional neural network).

In step S306, features are extracted for the images in the respective divided regions, and target features of the respective targets in the output image are obtained.

Steps S304 and S306 may be performed in parallel. To improve the accuracy of the target features of the respective targets, in some embodiments, for the image in each divided region, the image in the divided region is set to white, and the image of the other portion is set to black as a binarized image. And superposing the binarized image with the original image to obtain an image with the background removed. The image after the background is removed is input into an object detector to obtain the target characteristics of the targets in the output sub-areas. The object detector is, for example, faster R-CNN.

According to the method, through removing the background outside the target, the characteristics of the target can be extracted more accurately, and the image description text generated later is more accurate.

The training method of the whole model is described below with reference to fig. 4.

Fig. 4 is a flow chart of yet other embodiments of a method of generating an image description of the present disclosure. As shown in fig. 4, the steps S102 to S106 include: steps S402 to S408.

In step S402, a training sample is acquired, the training sample including: sample image, and descriptive text corresponding to the sample image.

In step S404, target features of each target in the sample image, target frame features of each target frame, are acquired.

The target feature and the target frame feature may be obtained according to the method of the foregoing embodiment, the target detection, and the extraction of the model of the target feature and the target frame feature may be performed in advance.

In step S406, a semantic tree of the sample image is constructed according to each target in the sample image, the target frame of each target, and the relationship of the sample image.

The method of constructing a semantic tree refers to the previous embodiments.

In step S408, training is performed on the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the semantic tree, the target features of each target in the sample image, the target frame features of each target frame.

In some embodiments, according to the relation of each node in the semantic tree of the sample image, the node corresponds to the target feature of each target and the node corresponds to the target frame feature of each target frame, the feature fusion is performed by using the tree-shaped long-short time memory network to be trained, and each target frame feature after the sample image fusion and the image global feature after the fusion are determined. And generating a model according to each target feature of the sample image, each fused target frame feature and each fused image global feature by utilizing the image description to be trained, and determining the description text of the output sample image. And adjusting parameters of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the description text of the output sample image and the description text of the marked sample image until the preset convergence condition is met, so that training of each model is completed.

The processing method of the target feature and the target frame feature of the sample image refers to the processing method of the embodiment in practical application.

The present disclosure also provides an image description generating apparatus, described below in connection with fig. 5.

Fig. 5 is a block diagram of some embodiments of a generating device of the image description of the present disclosure. As shown in fig. 5, the apparatus 50 of this embodiment includes: the semantic tree construction module 502, the feature fusion module 504, and the description generation module 506.

The semantic tree construction module 502 is configured to construct a semantic tree of an image according to each target in the image, a target frame of each target, and a relationship between the images; wherein, each node of the semantic tree corresponds to each target, each target frame and the image respectively.

In some embodiments, the semantic tree construction module 502 is configured to configure a node corresponding to the image as a root node of the semantic tree; configuring nodes corresponding to each target frame as middle layer nodes of the semantic tree; configuring the nodes corresponding to the targets as leaf nodes of the semantic tree; the leaf nodes corresponding to the targets are configured as child nodes of the target frames corresponding to the targets.

In some embodiments, the semantic tree construction module 502 is configured to arrange the target frames in order of area from large to small; sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence; determining the area of an overlapping area of a target frame corresponding to the node to be added and a target frame corresponding to each added node; and under the condition that the area of the overlapping area of the target frame corresponding to the node to be added exceeds a threshold value, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

The feature fusion module 504 is configured to perform feature fusion according to the relationship between each node in the semantic tree, the target feature of each target corresponding to the node, and the target frame feature of each target frame corresponding to the node, and determine the feature of each target frame after fusion and the global feature of the image after fusion by using the tree long-short time memory network; the object frame features are features of the in-frame image of each object.

In some embodiments, the feature fusion module 504 is configured to, starting from a layer where a leaf node of the semantic tree is located, input features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node into the tree-shaped long-short-term memory network to obtain fused features corresponding to the output parent node, and update the features corresponding to the parent node to the fused features; sequentially updating the corresponding characteristics of the nodes of each layer according to the sequence from bottom to top; according to the updated characteristics corresponding to each node, determining the characteristics of each target frame after fusion and the global characteristics of the image after fusion; when the root node is used as a father node, carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

The description generating module 506 is configured to determine a description text of the image according to each target feature, each fused target frame feature, and the fused image global feature by using the image description generating model.

In some embodiments, the description generation module 506 is configured to combine, for each target feature, the target feature with a corresponding target frame feature and a corresponding fused target frame feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; combining the first global target feature, the first global target frame feature and the fused image global feature to obtain a combined first global image expression feature; and inputting the first global image expression features and each first target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, the description generating module 506 is configured to combine the feature of the description word at the current moment, the first global image expression feature and the feature output at a moment on the second layer long-short-time memory network of the image description generating model, and input the combined feature into the first layer long-short-time memory network of the image description generating model; the method comprises the steps of storing the characteristics output by a network for a first layer of long and short time, and inputting the local characteristics of each first target into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, the description generating module 506 is configured to input each target feature into the graph rolling network to obtain each target feature after being output and updated; inputting the fused target frame characteristics into a graph convolution network to obtain output updated fused target frame characteristics; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; combining the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature aiming at each updated target feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame feature to obtain a second global target frame feature; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature; and inputting the second global image expression features and each second target local feature into an image description generation model to obtain the description text of the output image.

In some embodiments, the description generating module 506 is configured to input the feature of the description word at the current moment, the second global image expression feature, and the feature output at a moment on the second layer long-short-time memory network of the image description generating model, into the first layer long-short-time memory network of the image description generating model; the first layer of long-short-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer of long-short-time memory network, and inputting the combined characteristics into the second layer of long-short-time memory network to obtain the description word of the next moment of output.

In some embodiments, the apparatus 50 further comprises: the feature extraction module 508 is configured to perform target detection on the image to obtain each target frame and each segmentation region in the image; extracting features from the images in each target frame to obtain target frame features of each target frame in the output image; and extracting features from the images in each partitioned area to obtain target features of each target in the output image.

In some embodiments, the feature extraction module 508 is configured to set, for each image in the divided regions, the image in the divided regions to white and the image of the other portion to black as the binarized image; superposing the binarized image with the original image to obtain an image with the background removed; and inputting the image with the background removed into an object detector to obtain target characteristics of targets in the output sub-regions.

In some embodiments, the apparatus 50 further comprises: training module 510 is configured to obtain training samples, where the training samples include: a sample image and a description text corresponding to the sample image; acquiring target characteristics of each target in the sample image, and target frame characteristics of each target frame; according to each target in the sample image, the target frame of each target and the relation of the sample image, constructing a semantic tree of the sample image; according to the semantic tree, the target characteristics of each target in the sample image, the target frame characteristics of each target frame, the tree-shaped long-short time memory network to be trained and the image description generation model to be trained are trained.

In some embodiments, the training module 510 is configured to perform feature fusion by using a tree-shaped long-short-time memory network to be trained according to the relationship of each node in the semantic tree of the sample image, the target feature of each target corresponding to the node, and the target frame feature of each target frame corresponding to the node, and determine each target frame feature after the sample image is fused and the global feature of the image after the fusion; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, generating a model by utilizing image description to be trained, and determining a description text of the output sample image; and adjusting parameters of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the description text of the output sample image and the description text of the marked sample image until the preset convergence condition is met, so that training of each model is completed.

The generation of image descriptions in embodiments of the present disclosure may each be implemented by various computing devices or computer systems, described below in conjunction with fig. 6 and 7.

Fig. 6 is a block diagram of some embodiments of a generating device of the image description of the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the method of generating an image description in any of the embodiments of the present disclosure based on instructions stored in the memory 110.

The memory 610 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.

Fig. 7 is a block diagram of further embodiments of the image-producing apparatus of the present disclosure. As shown in fig. 7, the apparatus 70 of this embodiment includes: memory 710 and processor 720 are similar to memory 610 and processor 620, respectively. Input/output interface 730, network interface 740, storage interface 750, and the like may also be included. These interfaces 730, 740, 750, as well as the memory 710 and the processor 720, may be connected by a bus 760, for example. The input/output interface 730 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, etc. The network interface 740 provides a connection interface for various networking devices, such as may be connected to a database server or cloud storage server, or the like. Storage interface 750 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to enable any modification, equivalent replacement, improvement or the like, which fall within the spirit and principles of the present disclosure.

Claims

1. A method of generating an image description, comprising:

Constructing a semantic tree of an image according to each target in the image, a target frame of each target and a relation between the images; wherein each node of the semantic tree corresponds to each target, each target frame and the image respectively;

according to the relation of all nodes in the semantic tree, the target characteristics of all targets corresponding to the nodes and the target frame characteristics of all target frames corresponding to the nodes, performing characteristic fusion by utilizing a tree-shaped long-short time memory network, and determining the fused target frame characteristics and the fused image global characteristics; the target frame features are features of images in the target frames of the targets;

according to the target features, the fused target frame features and the fused image global features, using image description generation models to determine the description text of the image,

wherein said constructing a semantic tree of said image comprises:

configuring a node corresponding to the image as a root node of the semantic tree;

configuring nodes corresponding to the target frames as middle layer nodes of the semantic tree;

configuring the nodes corresponding to the targets as leaf nodes of the semantic tree;

The leaf nodes corresponding to the targets are configured as child nodes of the target frames corresponding to the targets.

2. The method of claim 1, wherein,

the configuring the node corresponding to each target frame as the middle layer node of the semantic tree comprises the following steps:

arranging the target frames according to the sequence from large to small;

sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence;

determining the area of an overlapping area of a target frame corresponding to the node to be added and a target frame corresponding to each added node;

and under the condition that the area of the overlapping area of the target frame corresponding to the node to be added exceeds a threshold value, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

3. The method of claim 1, wherein,

the feature fusion by using the tree-shaped long-short time memory network comprises the following steps:

starting from a layer where leaf nodes of the semantic tree are located, inputting features corresponding to all child nodes belonging to the same father node and features corresponding to the father node into the tree-shaped long-short-time memory network to obtain the output fused features corresponding to the father node, and updating the features corresponding to the father node into the fused features;

Sequentially updating the corresponding characteristics of the nodes of each layer according to the sequence from bottom to top;

according to the updated characteristics corresponding to each node, determining the characteristics of each target frame after fusion and the global characteristics of the image after fusion;

when the root node is used as a father node, carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame feature to obtain a first global target frame feature; and weighting the first global target feature and the first global target frame feature to obtain the input feature corresponding to the root node.

4. The method of claim 1, wherein,

and generating a model according to the target features, the fused target frame features and the fused image global features by using image description, wherein determining the description text of the image comprises the following steps:

combining the target feature with the corresponding target frame feature and the corresponding fused target frame feature aiming at each target feature to obtain a first target local feature;

carrying out average pooling operation on each target feature to obtain a first global target feature;

carrying out average pooling operation on each target frame feature to obtain a first global target frame feature;

Combining the first global target feature, the first global target frame feature and the fused image global feature to obtain a combined first global image expression feature;

and inputting the first global image expression features and each first target local feature into the image description generation model to obtain the output description text of the image.

5. The method of claim 4, wherein,

the step of inputting the first global image expression features and each first target local feature into the image description generation model to obtain the output description text of the image comprises the following steps:

combining the characteristics of the description words at the current moment, the first global image expression characteristics and the characteristics output at the moment on the second layer long-short-time memory network of the image description generation model, and inputting the characteristics into the first layer long-short-time memory network of the image description generation model;

the first layer long-short-term memory network output characteristics are input into an attention mechanism module through the first target local characteristics;

and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer long-short time memory network, and inputting the characteristics into the second layer long-short time memory network to obtain the description word of the next moment of output.

6. The method of claim 1, wherein,

inputting each target feature into a graph convolution network to obtain each output updated target feature; inputting the fused target frame features into a graph convolution network to obtain output updated fused target frame features; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic;

combining the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature aiming at each updated target feature to obtain a second target local feature;

carrying out average pooling operation on each updated target feature to obtain a second global target feature;

carrying out average pooling operation on each updated target frame feature to obtain a second global target frame feature;

carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic;

Combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature;

and inputting the second global image expression features and each second target local feature into the image description generation model to obtain the output description text of the image.

7. The method of claim 6, wherein,

the step of inputting the second global image expression features and each second target local feature into the image description generation model to obtain the output description text of the image comprises the following steps:

inputting the characteristics of the description words at the current moment, the second global image expression characteristics and the characteristics output at the moment on a second layer long-short-time memory network of the image description generation model into a first layer long-short-time memory network of the image description generation model;

inputting the characteristics output by the network into the attention mechanism module by the first layer of long-short-term memory network and the second target local characteristics;

8. The method of claim 1, further comprising:

performing target detection on the image to obtain each target frame and each segmentation area in the image;

extracting features from the images in each target frame to obtain target frame features of each target frame in the output images;

and extracting features from the images in each partitioned area to obtain target features of each target in the output images.

9. The method of claim 8, wherein,

extracting features from the images in each partitioned area to obtain target features of each target in the output image, wherein the target features comprise:

for the image in each divided area, setting the image in the divided area to white and the image of the other part to black as the binarized image;

superposing the binarized image with the original image to obtain an image with the background removed;

and inputting the image with the background removed into an object detector to obtain the target characteristics of the target in the output segmentation area.

10. The method of claim 1, further comprising:

obtaining a training sample, the training sample comprising: a sample image and a description text corresponding to the sample image;

Acquiring target characteristics of each target in the sample image, and target frame characteristics of each target frame;

constructing a semantic tree of a sample image according to each target in the sample image, a target frame of each target and the relation of the sample image;

according to the semantic tree, training the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the target characteristics of each target in the sample image and the target frame characteristics of each target frame.

11. The method of claim 10, wherein,

the training of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained comprises the following steps:

according to the relation of all nodes in the semantic tree of the sample image, the target characteristics of all the targets corresponding to the nodes and the target frame characteristics of all the target frames corresponding to the nodes, performing characteristic fusion by utilizing a tree-shaped long-short time memory network to be trained, and determining all the target frame characteristics after the sample image fusion and the image global characteristics after the fusion;

according to each target feature of the sample image, each fused target frame feature and each fused image global feature, generating a model by utilizing image description to be trained, and determining the description text of the output sample image;

And adjusting parameters of the tree-shaped long-short time memory network to be trained and the image description generation model to be trained according to the output description text of the sample image and the labeled description text of the sample image until preset convergence conditions are met, so that training of each model is completed.

12. An image description generation apparatus, comprising:

the semantic tree construction module is used for constructing a semantic tree of the image according to each target in the image, the target frame of each target and the relation between the images; wherein each node of the semantic tree corresponds to each target, each target frame and the image respectively;

the feature fusion module is used for carrying out feature fusion by utilizing a tree-shaped long-short time memory network according to the relation of each node in the semantic tree, the target features of each target corresponding to the node and the target frame features of each target frame corresponding to the node, and determining the features of each target frame after fusion and the global features of the image after fusion; the target frame features are features of images in the target frames of the targets;

a description generation module, configured to determine a description text of the image by using an image description generation model according to the target features, the fused target frame features and the fused image global features,

The semantic tree construction module is configured to configure a node corresponding to the image as a root node of the semantic tree, configure nodes corresponding to the target frames as intermediate layer nodes of the semantic tree, and configure nodes corresponding to the targets as leaf nodes of the semantic tree, wherein the leaf nodes corresponding to the targets are configured as child nodes of the target frames corresponding to the targets.

13. An image description generation apparatus, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the method of generating an image description according to any one of claims 1-11.

14. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the method of any of claims 1-11.