WO2023134073A1

WO2023134073A1 - Artificial intelligence-based image description generation method and apparatus, device, and medium

Info

Publication number: WO2023134073A1
Application number: PCT/CN2022/090158
Authority: WO
Inventors: 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-01-11
Filing date: 2022-04-29
Publication date: 2023-07-20
Also published as: CN114387430A; CN114387430B

Abstract

The present application relates to the technical field of knowledge representation and reasoning of artificial intelligence and discloses an artificial intelligence-based image description generation method and apparatus, a device, and a medium. The method comprises: obtaining an image to be described; performing text region detection according to the image to be described; performing text recognition on each text region according to the image to be described to obtain texts to be analyzed; performing target feature extraction according to the image to be described; and performing image description generation on the basis of a multi-modal feature fusion method and according to the image to be described, each text to be analyzed and each target feature to obtain an image description result. On the basis of a multi-modal feature fusion method, understanding is performed on a text-associated environment of an understood image to generate an image description, and thereby the abundant information in the image is completely expressed in language, and the accuracy of image description is improved.

Description

Image description generation method, device, equipment and medium based on artificial intelligence

This application claims the priority of the Chinese patent application with the application number 202210028089.7 submitted to the China Patent Office on January 11 , 2022 , and the title of the invention is " artificial intelligence-based image description generation method, device, equipment and medium ", the entire content of which Incorporated in this application by reference.

technical field

This application relates to the technical field of knowledge representation and reasoning of artificial intelligence, in particular to an artificial intelligence-based image description generation method, device, equipment and medium.

Background technique

Image description is to automatically generate a descriptive text for an image. The image description model not only needs to detect the objects in the image, recognize the text in the image, understand the relationship between objects, but also accurately describe the image information in language. The inventor realized that the text in the image is crucial for humans to understand the image information. When the model tries to understand the image with text, it not only detects the target, but also understands the text of the image and connects it to the environment for understanding. For example, the image is "There is a red ring on the brick wall, there is a text area on the diameter of the ring, and the text Mornington Crescent in the text area", the current image description model recognizes "There is a sign on the brick wall" , but not enough to understand "Mornington Crescent written in a red circle on the wall". Since the current image description model cannot understand the text of the image and understand it in relation to the environment, it cannot fully describe the rich information in the image.

technical problem

The main purpose of this application is to provide an artificial intelligence-based image description generation method, device, device, and medium, aiming to solve the problem that the image description model cannot understand the text of the image and connect it to the environment when performing image description in the prior art Understanding, leading to technical problems that cannot describe the rich information in the picture in detail.

technical solution

The present application proposes a method for generating image descriptions based on artificial intelligence, the method comprising:

Get the image to be described;

performing text region detection according to the image to be described;

performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

performing target feature extraction according to the image to be described;

Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.

The present application also proposes an artificial intelligence-based image description generating device, the device comprising:

An image acquisition module, configured to acquire an image to be described;

A text area detection module, configured to perform text area detection according to the image to be described;

A text recognition module, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

A target feature extraction module, configured to extract target features according to the image to be described;

The image description generating module is used to generate an image description based on the multi-modal feature fusion method according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.

The present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned image description generation method based on artificial intelligence are realized, wherein the Said method comprises steps:

Get the image to be described;

performing text region detection according to the image to be described;

performing target feature extraction according to the image to be described;

The present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned artificial intelligence-based image description generation method are implemented, wherein the method includes the steps of:

Get the image to be described;

performing text region detection according to the image to be described;

performing target feature extraction according to the image to be described;

Beneficial effect

The artificial intelligence-based image description generation method, device, device, and medium of the present application, wherein the method first performs text region detection based on the image to be described, and performs text recognition on each of the text regions based on the image to be described, Obtain the text to be analyzed, then perform target feature extraction based on the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed, and each of the target features based on the method of multimodal feature fusion , to get the image description result. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.

Description of drawings

Fig. 1 is a schematic flow chart of an image description generation method based on artificial intelligence according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of an artificial intelligence-based image description generation device according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.

The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

BEST MODE FOR CARRYING OUT THE INVENTION

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

Referring to FIG. 1, an artificial intelligence-based image description generation method is provided in an embodiment of the present application, the method comprising:

S1: Acquire the image to be described;

S2: Perform text region detection according to the image to be described;

S3: Perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

S4: Extract target features according to the image to be described;

S5: Based on the method of multi-modal feature fusion, generate an image description according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.

In this embodiment, firstly, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.

For S1, the image to be described input by the user can be obtained, the image to be described can also be exhaled from the database, and the image to be described can also be obtained from a third-party application system.

The image to be described is an image that needs to be described in language.

For S2, use a model based on the DB (Differentiable Binarization) algorithm to perform text region detection on the image to be described.

The text area is the area corresponding to the text box corresponding to the text in the image to be described.

It can be understood that the image to be described corresponds to one or more text regions.

For S3, use a model based on the CRNN (An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition) network, and perform a process on each text region according to the image to be described Text recognition, using the text corresponding to each text area as a text to be analyzed.

That is to say, there is a one-to-one correspondence between the text to be analyzed and the text regions.

The text to be analyzed includes: characters and/or symbols.

For S4, the model obtained based on the MASK-RCNN (semantic segmentation algorithm) network is used to extract the target feature of each target from the image to be described.

Objects include: objects, backgrounds.

Target features include: target appearance features, target position information and target mask map. The appearance feature of the target is the appearance feature of the target, that is, the image feature extracted from the image block corresponding to the target. The target position information is the position information of the target in the image to be described. The target mask map is a mask map (that is, a MASK map) generated according to the appearance characteristics of the target.

For S5, based on the method of multimodal feature fusion, firstly, the various features corresponding to each of the texts to be analyzed are mapped to the general vector embedding space obtained through learning, and the various features corresponding to each of the target features are mapped to the The universal vector embedding space obtained by learning is then input into the model obtained based on the multi-layer Transformer network to obtain the predicted words within the scope of the predefined code table, and finally from the recognized token (the text to be analyzed) word) and the words predicted within the scope of the predefined code table, select a word as the current output, combine the last output words into a sentence according to the output order, and use the sentence as the image description result, wherein, based on the multi-layer Transformer network In the process of cyclic decoding, the obtained model will use auto-regression to take the previous output word as part of the current input. That is to say, the image description result is a sentence for voice description of the image to be described.

In one embodiment, the above-mentioned step of performing text region detection according to the image to be described includes:

S21: Perform downsampling processing on the image to be described to obtain downsampling features;

S22: Perform upsampling processing on the downsampled features to obtain upsampled features;

S23: Perform cascading processing on the upsampled features to obtain a feature layer to be analyzed;

S24: Perform text probability map prediction according to the feature layer to be analyzed to obtain a target text probability map;

S25: Perform dynamic threshold map prediction according to the feature layer to be analyzed to obtain a target dynamic threshold map;

S26: Perform differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;

S27: Generate the text region according to the differentiable binarization map.

The existing DB algorithm directly predicts the probability map of the text area, and then judges whether each pixel is text or background according to the artificially set threshold. This processing method is too rough, and it is often difficult to detect the boundary. A single threshold will lead to The boundary is not accurate enough; and there are a lot of dense text in natural scenes, the boundary that is not accurate enough will make the dense text detection effect poor. In order to solve the above problems, this embodiment implements down-sampling, up-sampling and cascade processing on the image to be described, and then performs text probability map prediction and dynamic threshold map prediction on the feature layers obtained by the cascade processing, and finally according to the text Differentiable binarization calculations are performed on the probability map and dynamic threshold map, and the text area is generated according to the calculation results, so that more attention is paid to the boundary area. Differentiable binarization can set the threshold value in the decision layer of the model, allowing the model to automatically Judging the threshold value set by different pixel points improves the accuracy of boundary recognition, improves the effect of dense text detection, and improves the accuracy of image description.

For S21, a preset image segmentation model is used to perform downsampling processing on the image to be described, and the features obtained by downsampling are used as downsampling features.

Optionally, the preset image segmentation model is a model obtained based on FCN (Fully Convolutional Network) network training. It can be understood that the preset image segmentation model may also be a model trained based on other neural networks, which is not limited here.

For S22, the preset image segmentation model is used to perform up-sampling processing on the down-sampled features, and the up-sampled features are used as up-sampled features. By presetting the image segmentation model to perform down-sampling and up-sampling respectively, high-level features and low-level features can be well used to detect text areas of various scales. The high-level receptive field is large to detect large text areas, and the low-level To detect small text regions, so that the obtained upsampled features contain high-level rich semantic information and low-level rich representation information, making the feature expression more perfect.

For S23, each feature in the upsampled features is concatenated into a feature layer, and the feature layer obtained through the cascading process is used as the feature layer to be analyzed.

For S24, the preset text probability map prediction model is used to predict the text probability map according to the feature layer to be analyzed, and the predicted text probability map is used as the target text probability map.

The text probability map is a map formed by calculating the probability that a pixel belongs to text.

The preset text probability map prediction model is a model obtained by training a neural network by using a plurality of first training samples. The first training samples include: a first image sample and a first calibration value for each pixel. The first image sample is an image containing text. The first calibration value of each pixel is a calibration value of whether each pixel in the first image sample is a text area. The first calibration value of each pixel has two values, which are 0 (not a text area) and 1 (a text area).

For S25, a preset dynamic threshold map prediction model is used to predict the dynamic threshold map according to the feature layer to be analyzed, and the predicted dynamic threshold map is used as the target dynamic threshold map.

The dynamic threshold map is a map formed according to the dynamic threshold of each pixel.

The preset dynamic threshold map prediction model is a model obtained by training a neural network by using a plurality of second training samples. The second training samples include: a second image sample and a second calibration value for each pixel. The second image sample is an image containing text. The second calibration value of each pixel is a calibration value of whether each pixel in the second image sample is the boundary of the text area. The calibration value of each pixel has two values, which are 0 (not the boundary of the text area) and 1 (the boundary of the text area).

For S26, the differentiable binarization calculation formula of the pixel point in row i and column j in the differentiable binarization map is:

Wherein, P _ij is the probability of the pixel point in the i-th row j column in the target text probability map, T _ij is the threshold value of the pixel point in the i-th row j column in the target dynamic threshold value map, and k is Amplification factor, k is a constant, e is a constant of nature.

For S27, in the differentiable binarization map, the pixels whose value is greater than the preset threshold are used as the pixels of the text area, and the pixels whose value is less than or equal to the preset threshold are used as the pixels of the non-text area, and the adjacent The pixels of the text area are connected into graphic blocks, and each graphic block is regarded as a text area.

In one embodiment, the above-mentioned step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:

S31: According to each of the text regions, extract image blocks from the image to be described to obtain text image blocks;

S32: Using a model obtained based on a convolutional neural network, perform feature map extraction with a preset height on each of the text image blocks to obtain a feature map set;

S33: Sort each of the feature maps in each of the feature atlases by position to obtain a time-series feature atlas;

S34: Input each of the time-series feature atlases into a model based on a cyclic neural network for text recognition, and obtain the text to be analyzed corresponding to each of the text regions, wherein each preset in the preset tag dictionary is used The label is used as a predicted label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, and the preset label includes: text and a placeholder.

In this embodiment, the text image block corresponding to the text area is subjected to feature map extraction at a preset height, and the feature maps are sorted by position to generate a time-series feature atlas, and then each of the time-series feature atlases is input into the cyclic neural network-based The model performs text recognition. Because the feature maps are sorted by position to generate a time-series feature atlas, the width of the feature maps is reduced, which is conducive to the training of the model; because the width of the image after feature extraction by the model based on the convolutional neural network is The number of times the image is predicted from left to right, so the more the number of divisions, the less likely it is to miss characters, but too many times can easily cause a character to be recognized multiple times, resulting in the inability to judge whether there are repeated characters, which affects the Accuracy of text recognition, the present embodiment includes: text and placeholder by preset label, realizes placeholder is used as preset label, same character is separated by placeholder, there is no difference between codes corresponding to same character when decoding For the encoding corresponding to the placeholder, only one of the same characters needs to be reserved, thereby further improving the accuracy of text recognition.

For S31, extract image blocks corresponding to each of the text regions from the image to be described, and use each extracted image block as a text image block.

For S32, the model obtained based on the convolutional neural network (CNN) is used to extract the feature map of the preset height for each of the text image blocks, that is to say, the height of the feature map is the same as the preset height, and the extracted The obtained feature maps are used as a feature map set.

For S33, the feature maps in each feature map set are sorted by position, so as to obtain a time series feature map set with the sequence of positions as time series.

For S34, each of the time series feature atlases is input into the model obtained based on the cyclic neural network for text recognition, because each preset label in the preset label dictionary is used as the embedding layer of the model obtained based on the cyclic neural network The predicted label of the output dimension, so the range of the text recognition result is the preset label in the preset label dictionary.

Because the width of the image after feature extraction by the model based on the convolutional neural network is the number of times the image is predicted from left to right, the more the number of segmentations, the less likely it is to miss characters, but too many times are likely to cause a character to be predicted. The situation of multiple recognitions makes it impossible to judge whether there are repeated characters, which affects the accuracy of text recognition. By using the placeholder as the default label, the placeholder is used as the default label, and the same character is used as the placeholder Characters are separated, and there is no code corresponding to the placeholder between the codes corresponding to the same character during decoding, so one value needs to be reserved for the same character, thereby further improving the accuracy of text recognition. The preset label dictionary includes: preset labels and codes, and one-to-one correspondence between preset labels and codes.

Optionally, set the encoding corresponding to the placeholder to 0.

For example, use 26-letter text as the default label, set the code corresponding to the placeholder to 0, when the code of the predicted label is [2, 2, 2, 0, 15, 15, 0, 15, 15, 11], if there is 0 between "15, 15, 0, 15, 15" (the code corresponding to the placeholder), it can be determined that it is two repeated characters, and if there is no 0 in "2, 2, 2", it can be determined The expression is a character, so it is determined that the text to be analyzed is book,

For another example, use 26-letter text as the default label, set the code corresponding to the placeholder to 0, when the code of the predicted label is [0, 0, 2, 15, 15, 15, 15, 0, 11 , 11], if there is no 0 in "15, 15, 15, 15", it can be determined that the expression is a character, so it is determined that the text to be analyzed is bok, which is not specifically limited in this example.

The model based on the recurrent neural network is a model based on the bidirectional LSTM (long short-term memory artificial neural network) network.

In one embodiment, the above-mentioned step of extracting target features according to the image to be described includes:

S41: Perform image feature map extraction on the image to be described to obtain a feature map of the image to be analyzed;

S42: Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;

S43: According to each of the target candidate regions, extract image features from the image feature map to be analyzed to obtain target appearance features;

S44: Perform classification prediction according to the image features of each region to obtain a classification prediction result, wherein the classification prediction classification labels include: multiple object labels and a background label;

S45: Perform position regression processing according to each target appearance feature to obtain target position information;

S46: Using a model obtained based on a fully convolutional network, generating a mask map according to each of the target appearance features, to obtain a target mask map;

S47: Using the target appearance feature, the target position information and the target mask map corresponding to the same target candidate area as one target feature.

In this embodiment, each target feature is extracted based on the model obtained by ResNet-FPN, based on the ROI Align method, and based on the model obtained by the full convolution network. The target features include the target appearance feature, the target position information and the target. The mask map provides the basis for the subsequent understanding of the relationship between objects, and provides a basis for the method based on multi-modal feature fusion to understand the text-connected environment of the understood image to generate an image description.

For S41, use a model obtained based on ResNet-FPN to extract image features from the image to be described, and use the extracted image features as a feature map of the image to be analyzed.

FPN, the full name of Feature Pyramid Network, is a general-purpose architecture that can be used in combination with various skeleton networks, such as VGG (deep convolutional neural network), ResNet (residual network), etc.

For S42, the model obtained based on the region generation network is used, and the target candidate region is extracted according to the image to be described, wherein the model obtained based on the region generation network adopts a frame of a preset size (also called a priori frame, Anchor), Predict the candidate regions belonging to the target, and use each predicted candidate region as an object candidate region.

The model obtained based on the region generation network is the model obtained based on the RPN (RegionProposal Network).

For S43, based on the ROI Align method, an image feature map corresponding to each of the target candidate regions is extracted from the image feature map to be analyzed, and the image features corresponding to each of the target candidate regions are used as target appearance features.

ROI Align is a regional feature aggregation method.

Among them, the ROI Align method first maps the RoI (solid line part) to the feature map (dotted line part); then divides the feature map into 2*2 units, and assumes that each unit contains four sampling points, and each sampling Points are not integer coordinates, so bilinear interpolation is needed to estimate each sampling point; finally, pooling calculation is performed in each unit, and each unit gets a value, thus obtaining a 2*2 feature map.

For S44, the classification labels predicted by classification include: a plurality of object labels and a background label, and classification prediction is performed according to each of the region image features, so as to predict whether the image corresponding to the region image features is a background or an object. It can be understood that the object predicted here refers to what kind of object is predicted.

Optionally, a model based on FC layer (full connection layer) and softmax activation function (regression classification function) is used to perform classification prediction according to the image features of each region, and obtain classification prediction results.

For S45, perform target position regression processing according to each target appearance feature, and use the predicted position information as the target position information.

Optionally, use a model based on FC layer and bbox regressor to perform position regression processing according to each target appearance feature.

bbox regressor, which is the border regressor.

The target position information includes: that is, the position information of the detection frame corresponding to the target.

For S46, the model obtained based on the fully convolutional network is used to perform pixel-by-pixel mask prediction according to each of the target appearance features, and the target mask map is obtained according to the mask prediction results.

For S47, the target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature, that is, the target in the target feature The appearance feature, the target position information and the target mask map all correspond to the same target. It can be understood that the number of the target feature may be one or more.

In one embodiment, the above-mentioned method based on multimodal feature fusion, performing image description generation according to the image to be described, each of the texts to be analyzed and each of the target features, and the step of obtaining the image description result include:

S51: Perform feature fusion according to each of the target features to obtain a first fusion feature;

S52: Perform feature fusion on each text to be analyzed according to the image to be described to obtain a second fusion feature;

S53: Using the model obtained by the iterative Transformer, perform word prediction according to each of the first fusion features and each of the second fusion features, and obtain a word prediction result;

S54: Use the model obtained based on the dynamic pointer network to generate an image description according to the word prediction result and each text to be analyzed, and obtain the image description result.

In this embodiment, feature fusion is first performed on the target features, and then feature fusion is performed on the text to be analyzed. Finally, word prediction is performed based on the result of two feature fusions using the model obtained based on iterative Transformer, and finally the word prediction is performed based on the result of word prediction and the text to be analyzed. Image description generation realizes the method based on multi-modal feature fusion to understand the text of the understood image in relation to the environment to generate an image description, thereby expressing the rich information of the image in detail and completely, and improving the accuracy of image description .

For S51, various features corresponding to each of the target features are mapped to the general vector embedding space obtained through learning, and the mapped features are used as the first fusion feature.

For S52, according to the image to be described, various features corresponding to each text to be analyzed are mapped to a general vector embedding space obtained through learning, and the mapped features are used as the second fusion feature.

For S53, the input of the model obtained based on the iterative Transformer includes: the first fusion feature, the second fusion feature and the word predicted at the previous moment. For example, when predicting the third word, the predicted second word will be input into the model based on iterative Transformer as the word predicted at the previous moment.

The structure of Transformer is also composed of encoder (encoding) and decoder (decoding), and Transformer uses Attention (attention) mechanism to solve the problem of natural language translation.

Wherein, the first fusion feature, the second fusion feature and the historically predicted word input are obtained based on the iterative Transformer model to obtain a matrix shape, shape=(M+N+P, d), and d is a preset dimension , M is the word predicted in history, N is the word predicted at the current moment, P is the time step at the current moment, and identifies the number of words that have been predicted, and the initial value of P is 1.

One or more words are included in the word prediction result.

For S54, the model based on the dynamic pointer network is the model based on DPN (Dynamic pointer network).

Using a model obtained based on a dynamic pointer network, perform probability distribution modeling on the word prediction result and each of the texts to be analyzed, so as to determine which of the word prediction results and the text to be analyzed is to be selected. Words come out as words for image descriptions. It can be understood that the words selected by the model based on the dynamic pointer network are used for sentence splicing, and the spliced sentence is used as the image description result.

In one embodiment, the above-mentioned step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:

S511: Obtain one target feature as the target feature to be processed;

S512: Encode the target position information in the target feature to be processed to obtain the target position code;

S513: Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first value corresponding to the target feature to be processed. a fusion feature;

S514: Repeat the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.

In this embodiment, the various features of the target feature are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the understanding of the text-related environment of the understood image based on the method of multi-modal feature fusion provides the basis for generating image descriptions.

For S511, any one of the target features is acquired from each of the target features, and the acquired target feature is used as the target feature to be processed.

For S512, encode the target position information in the target feature to be processed to obtain the target position code.

The target location code is expressed as

xmin is the minimum value of the horizontal pixel position, ymin is the minimum value of the vertical pixel position, xmax is the maximum value of the horizontal pixel position, ymax is the maximum value of the vertical pixel position, Wim is the width, and Him is the height.

For S513, linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, using the formula

Expressed as:

Among them, LN is Linear Normalization, that is, linear normalization processing; W1 and W2 are preset constants,

is the target appearance feature in the target features to be processed,

is the target location code.

formula

The calculation result of is the first fusion feature corresponding to the target feature to be processed.

For S514, step S511 to step S514 are repeatedly executed until the acquisition of the target feature is completed. When the acquisition of the target features is completed, it means that the feature fusion of each target feature is completed.

In one embodiment, the above-mentioned step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain the second fusion feature includes:

S521: Obtain any one of the text to be analyzed as the text to be processed;

S522: Perform word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;

S523: According to the text to be processed, perform image feature extraction from the image to be described to obtain text block image features;

S524: Perform text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;

S525: Determine the position information of the text to be processed according to the image to be described, and obtain text position information;

S526: Encode the text position information to obtain a text position code;

S527: Perform a linear change on the text block word vector, the text block image feature, the text block encoding feature, and the text position encoding to map to a vector embedding space of a fourth preset dimension, to obtain the The second fusion feature corresponding to the text to be processed;

S528: Repeat the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.

In this embodiment, the various features of the text to be analyzed are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the text connection environment of the understood image in order to realize the method based on multi-modal feature fusion Understanding to generate image descriptions provides the basis.

For S521, acquire any one of the texts to be analyzed from each of the texts to be analyzed, and use the acquired texts to be analyzed as the texts to be processed.

For S522, perform FastText (shallow network) word vector encoding of a second preset dimension according to the text to be processed, and use the encoded FastText word vector as a text block word vector.

Optionally, the second preset dimension is set to 300.

FastText word vector is a word vector generated by FastText. FastText is a word vector calculation and text classification tool.

For S523, using a model obtained based on Faster RCNN (target detection algorithm), extracting image features corresponding to the text to be processed from the image to be described as text block image features.

For S524, the PHOC (Pyramidal Histogram of Characters) encoding method is used to perform text encoding of the third preset dimension on the text to be processed, and use the encoded data as the text block encoding feature.

Optionally, the third preset dimension is set to 604 dimensions.

For S525, according to the image to be described, determine the position information of the text region corresponding to the text to be processed, and use the determined position information as the text position information.

The text location information is the location information of the text area corresponding to the text to be processed in the image to be described.

For S526, encode the text position information, and use the encoded data as the text position code.

For S527, linearly change the word vector of the text block, the image feature of the text block, the coding feature of the text block and the text position coding to map to the vector embedding space of the fourth preset dimension, using the formula

Expressed as:

Among them, LN is Linear Normalization, that is, linear normalization processing; W3, W5 and W6 are preset constants,

is the text block word vector,

is the image feature of the text block,

is the text block encoding feature,

is the text position code.

formula

The calculation result is the second fusion feature corresponding to the text to be processed.

For S528, step S521 to step S528 are repeatedly executed until the acquisition of the text to be analyzed is completed. When the acquisition of the text to be analyzed is completed, it means that the feature fusion of each text to be analyzed is completed.

Referring to Fig. 2, the present application also proposes a device for generating image descriptions based on artificial intelligence, the device comprising:

An image acquisition module 100, configured to acquire an image to be described;

A text area detection module 200, configured to perform text area detection according to the image to be described;

A text recognition module 300, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

A target feature extraction module 400, configured to extract target features according to the image to be described;

The image description generating module 500 is configured to generate an image description based on the multimodal feature fusion method according to the image to be described, each text to be analyzed and each target feature, and obtain an image description result.

Referring to FIG. 3 , an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as an image description generation method based on artificial intelligence. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, an image description generation method based on artificial intelligence is realized, wherein the method includes the steps of:

Get the image to be described;

performing text region detection according to the image to be described;

performing target feature extraction according to the image to be described;

An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, an image description generation method based on artificial intelligence is implemented, wherein the method includes steps:

Get the image to be described;

performing text region detection according to the image to be described;

performing target feature extraction according to the image to be described;

In the image description generation method based on artificial intelligence performed above, first, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described, to obtain the text to be analyzed, and then according to the image to be described, Extract the target feature from the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed and each of the target features based on the method of multimodal feature fusion, and obtain an image description result. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory The medium may also be a volatile computer-readable storage medium. When the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media provided in the present application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, apparatus, article or method comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, apparatus, article, or method. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, device, article or method that includes that element.

The above are only preferred embodiments of the application, and are not intended to limit the patent scope of the application. Any equivalent structure or equivalent process transformation made by using the specification and drawings of the application, or directly or indirectly used in other related All technical fields are equally included in the patent protection scope of the present application.

Claims

A method for generating image descriptions based on artificial intelligence, wherein the method includes:

Get the image to be described;

performing text region detection according to the image to be described;

performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

performing target feature extraction according to the image to be described;

Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
The artificial intelligence-based image description generation method according to claim 1, wherein the step of performing text region detection according to the image to be described comprises:

performing downsampling processing on the image to be described to obtain downsampling features;

Performing upsampling processing on the downsampled features to obtain upsampled features;

Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;

Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;

Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;

performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;

The text region is generated according to the differentiable binarization map.
The artificial intelligence-based image description generation method according to claim 1, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:

Extracting image blocks from the image to be described according to each text region to obtain text image blocks;

Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;

Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;

Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
The image description generation method based on artificial intelligence according to claim 1, wherein the step of extracting target features according to the image to be described comprises:

Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;

Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;

According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;

Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;

Perform position regression processing according to each target appearance feature to obtain target position information;

Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;

The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
The image description generation method based on artificial intelligence according to claim 1, wherein, in the method based on multimodal feature fusion, an image is performed according to the image to be described, each of the text to be analyzed, and each of the target features Description generation, the steps of obtaining image description results, including:

performing feature fusion according to each of the target features to obtain a first fusion feature;

performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;

Using the model obtained based on the iterative Transformer, performing word prediction according to each of the first fusion features and each of the second fusion features, to obtain a word prediction result;

A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.
The method for generating image descriptions based on artificial intelligence according to claim 1, wherein the step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:

Obtaining one of the target features as the target feature to be processed;

Encoding the target position information in the target feature to be processed to obtain the target position code;

Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first fusion corresponding to the target feature to be processed feature;

Repeating the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
The image description generation method based on artificial intelligence according to claim 1, wherein the step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain a second fusion feature includes:

Obtain any one of the texts to be analyzed as texts to be processed;

performing word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;

According to the text to be processed, image feature extraction is performed from the image to be described to obtain image features of text blocks;

performing text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;

Determining the position information of the text to be processed according to the image to be described to obtain text position information;

Encoding the text position information to obtain a text position code;

The text block word vector, the text block image feature, the text block coding feature and the text position code are linearly changed to map to the vector embedding space of the fourth preset dimension, and the to-be-processed The second fusion feature corresponding to the text;

Repeating the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.
A device for generating image description based on artificial intelligence, wherein the device includes:

An image acquisition module, configured to acquire an image to be described;

A text area detection module, configured to perform text area detection according to the image to be described;

A text recognition module, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

A target feature extraction module, configured to extract target features according to the image to be described;

The image description generation module is used to generate an image description based on the multi-modal feature fusion method according to the image to be described, each text to be analyzed and each target feature, and obtain an image description result.
A computer device, comprising a memory and a processor, the memory stores a computer program, wherein when the processor executes the computer program, an image description generation method based on artificial intelligence is realized;

Wherein, the image description generation method based on artificial intelligence comprises steps:

Get the image to be described;

performing text region detection according to the image to be described;

performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

performing target feature extraction according to the image to be described;

Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
The computer device according to claim 9, wherein the step of performing text region detection according to the image to be described comprises:

performing downsampling processing on the image to be described to obtain downsampling features;

Performing upsampling processing on the downsampled features to obtain upsampled features;

Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;

Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;

Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;

performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;

The text region is generated according to the differentiable binarization map.
The computer device according to claim 9, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:

Extracting image blocks from the image to be described according to each text region to obtain text image blocks;

Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;

Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;

Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
The computer device according to claim 9, wherein the step of extracting target features according to the image to be described comprises:

Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;

Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;

According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;

Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;

Perform position regression processing according to each target appearance feature to obtain target position information;

Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;

The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
The computer device according to claim 9, wherein, in the method based on multimodal feature fusion, an image description is generated according to the image to be described, each of the texts to be analyzed, and each of the target features to obtain an image description Resulting steps, including:

performing feature fusion according to each of the target features to obtain a first fusion feature;

performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;

Using the model obtained based on the iterative Transformer, performing word prediction according to each of the first fusion features and each of the second fusion features, to obtain a word prediction result;

A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.
The computer device according to claim 9, wherein the step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:

Obtaining one of the target features as the target feature to be processed;

Encoding the target position information in the target feature to be processed to obtain the target position code;

Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first fusion corresponding to the target feature to be processed feature;

Repeating the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
The computer device according to claim 9, wherein the step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain a second fusion feature includes:

Obtain any one of the texts to be analyzed as texts to be processed;

performing word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;

According to the text to be processed, image feature extraction is performed from the image to be described to obtain image features of text blocks;

performing text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;

Determining the position information of the text to be processed according to the image to be described to obtain text position information;

Encoding the text position information to obtain a text position code;

The text block word vector, the text block image feature, the text block coding feature and the text position code are linearly changed to map to the vector embedding space of the fourth preset dimension, and the to-be-processed The second fusion feature corresponding to the text;

Repeating the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.
A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, an image description generation method based on artificial intelligence is realized;

Wherein, the image description generation method based on artificial intelligence comprises steps:

Get the image to be described;

performing text region detection according to the image to be described;

performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;

performing target feature extraction according to the image to be described;

Based on the method of multimodal feature fusion, image description generation is performed according to the image to be described, each of the text to be analyzed and each of the target features, and an image description result is obtained.
The computer-readable storage medium according to claim 16, wherein, the step of performing text region detection according to the image to be described comprises:

performing downsampling processing on the image to be described to obtain downsampling features;

Performing upsampling processing on the downsampled features to obtain upsampled features;

Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;

Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;

Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;

performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;

The text region is generated according to the differentiable binarization map.
The computer-readable storage medium according to claim 16, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:

Extracting image blocks from the image to be described according to each text region to obtain text image blocks;

Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;

Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;

Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
The computer-readable storage medium according to claim 16, wherein the step of extracting target features according to the image to be described comprises:

Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;

Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;

According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;

Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;

Perform position regression processing according to each target appearance feature to obtain target position information;

Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;

The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
The computer-readable storage medium according to claim 16, wherein, in the method based on multimodal feature fusion, an image description is generated according to the image to be described, each of the texts to be analyzed, and each of the target features, The steps for obtaining the image description result include:

performing feature fusion according to each of the target features to obtain a first fusion feature;

performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;

Adopt the model that obtains based on iterative Transformer, carry out word prediction according to each described first fusion feature and each described second fusion feature, obtain word prediction result;

A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.