CN114743018A

CN114743018A - Image description generation method, device, equipment and medium

Info

Publication number: CN114743018A
Application number: CN202210423256.8A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-12
Anticipated expiration: 2042-04-21
Also published as: CN114743018B

Abstract

The invention relates to the technical field of artificial intelligence, and provides an image description generation method, device, equipment and medium. The method comprises the following steps: inputting an image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected; inputting the region characteristics into a preset label attention model for weight calculation, and outputting the category embedding of the image to be detected; inputting the region characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder; and embedding the output value and the category into a decoder of the preset transformation model for processing to generate a description text of the image to be detected. The invention also relates to the technical field of block chains, and the region characteristics and the category embedding can be stored in a node of a block chain.

Description

Image description generation method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an image description generation method, device, equipment and medium.

Background

Image description (Image capture) is a comprehensive emerging discipline that merges computer vision technology, natural language processing technology, and machine learning technology. The purpose of the image description is to automatically generate a piece of descriptive text according to the picture content.

With the popularity of the Transformer model in the NLP field, many Transformer-based image description methods have been developed and demonstrated better performance than most conventional methods, which are improved in terms of the attention mechanism module at the input position coding and encoder portion to better adapt to the model with the image as input compared to the Transformer model for natural language processing.

However, the current method cannot integrate abstract features such as the relationship between image objects and the mapping relationship between the objects and corresponding labels into an attention mechanism, and the obtained description information is not accurate and rich enough.

Disclosure of Invention

In view of the above, the present invention provides an image description generation method, apparatus, device and medium, which aims to solve the technical problem in the prior art that the image generation description information is not accurate and rich enough.

In order to achieve the above object, the present invention provides an image description generating method, including:

inputting an image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected;

inputting the region characteristics into a preset label attention model for weight calculation, and outputting the category embedding of the image to be detected;

inputting the region characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder;

and embedding the output value and the category into a decoder which inputs the preset transformation model for processing to generate a description text of the image to be detected.

Preferably, the inputting the image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected includes:

according to a preset geometric relation calculation formula, carrying out frame recognition on a target contained in the image to be detected to obtain frames of the target and a target category of each frame;

and adjusting the size of the frame to a preset range, and outputting the regional characteristics of the image to be detected.

Preferably, the preset geometric relationship calculation formula includes:

xi (a, b) is the regional characteristic of the image to be measured, (x)_a，y_a) Is the coordinate of the center point of the a-th frame of the image to be detected, (x)_b，y_b) Is the center point coordinate of the b-th frame of the image to be detected, (w)_a，h_a) Is the width and height of the a-th frame, (w)_b，h_b) The width and height of the b-th frame.

Preferably, the inputting the region feature into a preset tag attention model for weight calculation and outputting the category embedding of the image to be detected includes:

matching the target category of the image to be detected with preset words of a preset multi-dimensional dictionary according to a preset matching formula to obtain a predicted word and a target label of the target category;

and coding and embedding the predictive words according to a preset first attention formula to obtain the category embedding of the image to be detected.

Preferably, the encoding and embedding the predictive word according to a preset first attention formula to obtain the category embedding of the image to be detected, the preset tag attention model includes a plurality of attention modules, each of the attention modules includes an independent scaling dot product attention function, and the method includes:

a1, inputting the predicted word into a matrix of a first attention module for weight calculation according to the preset first attention calculation formula and the scaling dot product attention function, and outputting a first weight value of the first attention module;

a2, inputting the first weight into a matrix of a second attention module for weight calculation, and outputting a second weight value of the second attention module;

and A3, repeating A1-A2 to obtain the weight values of all attention modules, splicing all weight values according to a series splicing function, and outputting the category embedding of the image to be detected.

Preferably, the encoder includes a plurality of identical encoding layers, each encoding layer includes a multi-head self-attention sublayer and a position feedforward sublayer, the multi-head self-attention sublayer includes a plurality of parallel head modules, the inputting the region feature into the encoder of the preset transformation model for processing, and outputting the output value of the encoder includes:

b1, inputting the geometric features of the region features into a matrix of a first parallel head module in a first coding layer for weight calculation according to a preset second attention calculation formula, and outputting a first result value of the first parallel head module;

b2, inputting the first result value into a matrix of a second parallel head module for weight calculation, and outputting a second result value of the second parallel head module;

b3, repeating B1-B2 to obtain the result values of all the parallel head modules, splicing all the result values according to a preset splicing formula, inputting the spliced result values into the position feedforward sub-layer block for nonlinear transformation, and inputting the transformed result values into a second coding layer of the coder;

b4, repeat B1-B3 output values for all encoded layers.

Preferably, the decoder includes a plurality of identical decoding layers, each decoding layer includes a multi-headed self-attention sublayer, a multi-headed cross-attention sublayer and a position forward sublayer, and the decoder that embeds the output value and the class into the preset transformation model performs processing to generate the description text of the image to be detected, including:

position embedding is carried out on an output value of the last coding layer to be used as input of the covering multi-head self-attention sublayer, and an input word vector is obtained;

embedding and inputting each output value, the input word vector and the category of the target into the multi-head cross attention sublayer for cross attention calculation to obtain a weight matrix;

and inputting the weight matrix into the position forward sublayer to calculate to generate a plurality of keywords, and splicing all the keywords to generate a description text of the image to be detected.

To achieve the above object, the present invention also provides an image description generating apparatus, comprising:

an identification module: the system comprises a target detection module, a target detection module and a target detection module, wherein the target detection module is used for inputting an image to be detected into the preset target detection module for identification and outputting the regional characteristics of the image to be detected;

a calculation module: the system is used for inputting the region characteristics into a preset label attention model for weight calculation and outputting the category embedding of the image to be detected;

an output module: the encoder is used for inputting the region characteristics into a preset transformation model for processing and outputting an output value of the encoder;

a generation module: and the decoder is used for embedding the output value and the category into the preset transformation model for processing to generate a description text of the image to be detected.

In order to achieve the above object, the present invention also provides an electronic device, including:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a program executable by the at least one processor to enable the at least one processor to perform the image description generation method of any one of claims 1 to 7.

To achieve the above object, the present invention further provides a computer readable medium storing an image description generation, which when executed by a processor, implements the steps of the image description generation method according to any one of claims 1 to 7.

The invention is composed of a preset target detection model, a preset label attention model and a Transformer model (a preset transformation model). And identifying and classifying the image to be detected according to a preset target detection model, establishing a geometric relationship and a position relationship between any two targets by combining an identification frame in the identification process, and outputting the category and the region characteristics of the target of the image to be detected.

According to the preset label attention model and the preset multi-dimensional dictionary, important class embedding is given to targets which frequently appear in the regional characteristics to serve as new labels, target values of encoders of the class embedding and preset transformation models are input into a decoding stage of the preset transformation model, and description information of the image to be detected is generated. The correctness of the relation between the targets in the image description information is improved, so that the description content is richer.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a preferred embodiment of the image description generation method of the present invention;

FIG. 2 is a block diagram of an image descriptor generating apparatus according to a preferred embodiment of the present invention;

FIG. 3 is a diagram of an electronic device according to a preferred embodiment of the present invention;

the objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The invention provides an image description generation method. Referring to fig. 1, a method flow diagram of an embodiment of the image description generation method of the present invention is shown. The method may be performed by an electronic device, which may be implemented by software and/or hardware. The image description generation method includes the following steps S10-S40:

step S10: and inputting the image to be detected into a preset target detection model for identification, and outputting the regional characteristics of the image to be detected.

The specific step S10 includes:

In this embodiment, the preset target detection model includes, but is not limited to, a FasterR-CNN target detection model, which is a model integrating feature extraction, border regression, and classification. The method comprises the following steps of taking an MSCOCO data set as a classification database of a preset target detection model, and carrying out target detection on an input image to be detected by the preset target detection model, wherein the method specifically comprises the following steps: firstly, extracting image features from an image by a Convlayers layer of a preset target detection model, sharing the image features in an RPN layer and a full connection layer for calculation and frame identification according to a preset geometric relation calculation formula, and obtaining frames of the target and target categories of each frame, wherein the target categories comprise foreground information and background information (for example, the foreground information is a target object of the image).

In the process of frame identification, in order to better cover the characteristics of the image to be detected, the frame is coded, and four coordinate parameters are used

The position information of the anchor point and the real frame is represented, the four coordinate parameters respectively represent the coordinate of the central point, the width and the height of the target frame, the anchor point is continuously close to the real frame through linear regression learning of the four scalars, and therefore the frame of the target in the image to be detected is accurately obtained.

For convenience in describing text generation, a bilinear interpolation method is used for the image feature mapping region corresponding to the target, the size of the frame is adjusted to a preset range (for example, the size of the frame is adjusted to the edge pixel part of the target in the image), and finally the region feature of the input image is obtained. In order to take the geometric relationship and the position relationship between the targets into consideration in generating the description text, the geometric relationship and the position relationship can be established between any two targets based on a frame obtained in target detection, the geometric relationship represents the relationship between different targets on the image to be detected, and the position relationship represents the position representation of one target on the image to be detected.

In one embodiment, the predetermined geometric relationship calculation formula includes:

The geometric relationship and the position relationship between two targets of the image to be detected can be obtained through xi (a, b), and the geometric characteristics between the targets in different areas can be obtained through transformation

Geometric characteristics

The formula (2) includes:

wherein Emb is to embed geometric features between targets, and map a relation vector xi (a, b) between targets to a higher dimension, w_GTo project the vector result to a scalar learnable vector.

Step S20: and inputting the region characteristics into a preset label attention model for weight calculation, and outputting the category embedding of the image to be detected.

The specific step S20 includes:

In an embodiment, the encoding and embedding the predictive word according to a preset first attention formula to obtain a category embedding of the image to be detected, the preset tag attention model includes a plurality of attention modules, each of the attention modules includes an independent scaled dot product attention function, and the method includes:

In one embodiment, the preset matching formula includes:

Lⁱ＝Emb(D(w^j))，whenCⁱ＝＝D(w^j)

wherein L isⁱIs the ith target label of the image to be detected, CⁱThe ith target class, whenC, of the image to be measuredⁱ＝＝D(w^j) And correspondingly presetting a jth preset word of a multi-dimensional dictionary for the ith target category of the image to be detected.

In one embodiment, to give more weight to more important and more frequently occurring object classes, at LⁱOn the basis, the ranking of the ith target label in all the detection targets is calculated, and the method specifically comprises the following steps: rⁱ＝Lⁱ*Pr(Cⁱ) Wherein Pr (C)ⁱ) And (4) the probability of the corresponding target class of the ith target label in all classes.

In one embodiment, the preset first attention calculation formula includes:

L_Att＝σ(MHA(L，Rⁱ，L))

wherein L is_AttEmbedding the category of the image to be detected, wherein sigma is a sigmoid activation function, L is the region characteristic, and RⁱRanking the ith target label in all detected targets.

In one embodiment, the formula for calculating the scaled dot product attention function for each of the scaled dot product attention functions comprises

Q_i＝W_qQ，V_i＝W_vV，K_i＝W_kK

Wherein d is a low-dimensional vector input by the image to be detected, Q, K and V are query, key and value matrixes of the preset label attention model respectively, Concat is a serial splicing function,

h attention modules, W, for said preset tag attention model^OAttention is a function of Attention for weight values.

In one embodiment, before the step S20, the method further includes:

acquiring a plurality of preset words prestored corresponding to different images in a preset corpus;

and calculating word frequency values of all preset words appearing in the preset corpus, and constructing the preset multi-dimensional dictionary according to the preset words with the word frequency values larger than a preset value.

The corpus of MSCOCO datasets is made up of a large number of image descriptions (explanatory text) corresponding to images, where each image may correspond to multiple image descriptions. The image descriptions are all composed of a plurality of preset words, the preset words with the occurrence frequency larger than a preset value (for example, the preset value is 5) in all the image descriptions are constructed into a preset multi-dimensional dictionary and used as a reference for generating a target label, and the preset words with the occurrence frequency more than 5 times are constructed into the preset multi-dimensional dictionary, so that the generated descriptions of the images can be more anthropomorphic.

Step S30: and inputting the region characteristics into an encoder of a preset transformation model for processing, and outputting an output value of the encoder.

In step S30, the encoder includes a plurality of identical encoding layers, each encoding layer includes a multi-head self-attention sublayer and a position feedforward sublayer, the multi-head self-attention sublayer includes a plurality of parallel head modules, and the encoder includes:

b4, repeat B1-B3 output values for all encoded layers.

In one embodiment, said inputting the geometric feature of the region feature into a matrix of a first parallel-header module for weight calculation in the first coding layer, and outputting a first result value of the first parallel-header module includes:

activating a scaling dot product attention function corresponding to the first parallel head module, and mapping the geometric characteristics of the region characteristics to a matrix of the first parallel head module for characteristic embedding;

and embedding the relation vector between the targets into different sub-modules of the multi-head self-attention sublayer for fusion by adjusting the weight parameters, and outputting a first result value of the first parallel head module.

In one embodiment, the preset second attention calculation formula includes:

h_i(Q，K，V，η)＝Attention(Q，K，V，η)＝softmax(η_i)V_i，i∈[1，N]，

wherein eta is the geometric feature to be fused into the image to be measured, h_iFor the ith parallel head module of the multi-head self-Attention sublayer, Attention is the Attention function, and Q, K and V are query, key of the multi-head self-Attention sublayerValue matrix.

Each h_iThe calculation formula comprises:

h_i(Q，K，V，η)＝Attention(Q，K，V，η)＝softmax(η_i)V_i，i∈[1，N]

each h_iThe equation for calculating the middle eta comprises:

η_Gfor geometric relations, eta, between different objects of the image to be measured^abAnd integrating the attention weight of the image to be measured after the geometric relation is integrated.

In one embodiment, the preset splicing formula includes:

wherein the content of the first and second substances,

concat is an initial value of the image to be measured and is a splicing function,

h parallel head modules of the multi-headed self-care sublayer, W^OIs a weight value.

Step S40: and embedding the output value and the category into a decoder of the preset transformation model for processing to generate a description text of the image to be detected.

In step S40, the decoder includes a plurality of identical decoding layers, each decoding layer includes a multi-headed self-attention sublayer, a multi-headed cross-attention sublayer and a position forward sublayer, and includes:

position embedding is carried out on the output value of the last coding layer to be used as the input of the covering multi-head self-attention sublayer, and an input word vector is obtained;

In this embodiment, a target type of predicted word is first position-coded, then the coded predicted word is input into a covering multi-head self-attention sublayer to obtain a word vector of a weighted sentence, the word vector is a V vector of a first multi-head cross-attention sublayer, an output value of a last layer of a coder is converted into a Q, K vector through two linear conversion layers, and then multi-head self-attention operation is performed with the V vector to obtain a V vector (equal to an input word vector) fused with similarity information.

And after the operation of 6 decoder layers in total, according to the vocabulary of a preset transformation model, concentrating the word information of the real sentence corresponding to each training picture, and passing the output vector through one linear layer and one softmax layer to obtain the next keyword.

And splicing all the keywords to generate a plurality of output sentences, setting the beamsize to be 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the sentence with the highest score as the description text.

In one embodiment, the embedding each of the output values, the input word vector, and the class of the target into the multi-headed cross-attention sublayer for cross-attention calculation includes:

embedding the output value and the category of the target for mutual integration according to a preset cross attention calculation formula to obtain an integrated value

And according to a preset weight calculation formula, performing weight calculation on the blended value and the input word vector to obtain a cross attention matrix.

In one embodiment, the preset cross-attention calculation formula includes:

wherein MA is a fusion connection attention module of the multi-head cross attention sublayer, α_iA weight matrix, of the same size as the cross-attention result, weights may adjust the degree of contribution of each layer of the encoder output,

and Y is the input word vector.

In one embodiment, the preset blending calculation formula of the blending value includes:

wherein, the

In order to be the value of the blend-in,

is the output value, L_AttEmbedding for the class of the object.

In one embodiment, the weight calculation formula is preset, and comprises:

[·，·]for the merge operation, σ is the sigmoid activation function, W_i∈R^2d×dAs a weight matrix, b_iTo learn the bias parameters, the bias parameters.

Since the sequence in the encoder is input once, all input information can be acquired when the masked multi-headed self-attention sublayer is calculated, but in the decoder, in order to ensure that only sequence information output before the current time can be seen at each time, the masked multi-headed self-attention sublayer is introduced, and the input word vector is a calculation result of the input information passing through the masked multi-headed self-attention sublayer.

Each head cross attention sublayer needs to pass through a regularization Add & Norm layer, a position forward sublayer (FFN layer) and an Add & Norm layer, and the input of each head cross attention sublayer is converted to have the same mean variance, so that convergence can be accelerated, and the calculation formula is as follows:

and finally, determining the next output keyword according to the output characteristics of the last decoding layer, wherein the dimension of the characteristics of the output keyword is the same as the dimension of the vocabulary.

Referring to fig. 2, a functional block diagram of the image description generating apparatus 100 according to the present invention is shown.

The image description generation apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the image description generation apparatus 100 may include an identification module 110, an identification module 20, an output module 130, and a generation module 140. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and are stored in a memory of the electronic device.

In the present embodiment, the functions of the modules/units are as follows:

the identification module 110: the system comprises a target detection module, a target detection module and a target detection module, wherein the target detection module is used for inputting an image to be detected into the preset target detection module for identification and outputting the regional characteristics of the image to be detected;

the identification module 20: the system is used for inputting the region characteristics into a preset label attention model for weight calculation and outputting the category embedding of the image to be detected;

the output module 130: the encoder is used for inputting the region characteristics into a preset transformation model for processing and outputting an output value of the encoder;

the generation module 140: and the decoder is used for embedding the output value and the category into the preset transformation model for processing, and generating a description text of the image to be detected.

In one embodiment, the inputting the image to be detected into a preset target detection model for recognition, and outputting the region characteristics of the image to be detected includes:

xi (a, b) is the regional characteristic of the image to be measured, (x)_a，y_a) Is the coordinate of the center point of the a-th frame of the image to be detected, (x)_b，y_b) Is that theThe coordinates of the center point of the (b) th frame of the image to be measured, (w)_a，h_a) Is the width and height of the a-th frame, (w)_b，h_b) The width and height of the b-th frame.

In one embodiment, the inputting the region feature into a preset tag attention model for weight calculation and outputting the category embedding of the image to be detected includes:

In an embodiment, the encoding and embedding the predicted word according to a preset first attention formula to obtain the category embedding of the image to be detected, the preset tag attention model includes a plurality of attention modules, each of the attention modules includes an independent scaling dot product attention function, and the method includes:

and A3, repeating A1-A2 to obtain the weight values of all the attention modules, splicing all the weight values according to a series splicing function, and outputting the category embedding of the image to be detected.

In one embodiment, the encoder includes a plurality of identical encoding layers, each encoding layer includes a multi-headed self-attention sublayer and a position feedforward sublayer, the multi-headed self-attention sublayer includes a plurality of parallel head modules, the inputting the region feature into the encoder of the preset transformation model for processing, and the outputting the output value of the encoder includes:

b2, inputting the first result value into a matrix of a second parallel header module for weight calculation, and outputting a second result value of the second parallel header module;

b4, repeat B1-B3 output values for all encoded layers.

In one embodiment, the decoder includes a plurality of identical decoding layers, each decoding layer includes a multi-headed self-attention sublayer, a multi-headed cross-attention sublayer and a position forward sublayer, and the decoder that embeds the output values and the categories into the preset transformation model generates the description text of the image to be detected, including:

Fig. 3 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the invention.

The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain raw data. The network may be a wireless or wired network such as an Intranet (Internet), the Internet (Internet), a global system for mobile communications (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a call network.

The memory 11 includes at least one type of readable medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), or the like, which is equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various types of application software, such as program codes of the image description generation 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the program code of the image description generator 10.

The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch panel, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, e.g. displaying the results of data statistics.

The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the network interface 14 typically being used for establishing a communication connection between the electronic device 1 and other electronic devices.

Fig. 3 only shows the electronic device 1 with the components 11-14 and the image description generation 10, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.

Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

The electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.

In the above embodiment, the processor 12, when executing the image description generation 10 stored in the memory 11, may implement the following steps:

and embedding the output value and the category into a decoder of the preset transformation model for processing to generate a description text of the image to be detected.

The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.

For detailed description of the above steps, please refer to the above description of fig. 2 regarding a functional block diagram of an embodiment of the image description generating apparatus 100 and fig. 1 regarding a flowchart of an embodiment of the image description generating method.

In addition, the embodiment of the present invention further provides a computer-readable medium, which may be non-volatile or volatile. The computer readable medium may be any one or any combination of hard disk, multimedia card, SD card, flash memory card, SMC, Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), USB memory, and the like. The computer readable medium comprises a storage data area and a storage program area, the storage data area stores data created according to the use of the block chain nodes, the storage program area stores an image description generation 10, and the image description generation 10 realizes the following operations when being executed by a processor:

The specific implementation of the computer readable medium of the present invention is substantially the same as the specific implementation of the image description generation method, and is not repeated herein.

In another embodiment, in order to further ensure the privacy and security of all the appearing data, all the data may be stored in a node of a block chain. Such as region characteristics, category embedding, all of which may be stored in block link points.

It should be noted that the block chain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, herein are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An image description generation method, characterized in that the method comprises:

2. The image description generation method of claim 1, wherein the inputting an image to be detected into a preset target detection model for recognition and outputting the region characteristics of the image to be detected comprises:

3. The image description generation method according to claim 2, wherein the preset geometric relationship calculation formula includes:

xi (a, b) is the regional characteristic of the image to be measured, (x)_a，y_a) Is the coordinate of the center point of the a-th frame of the image to be detected, (x)_b，y_b) Is the center point coordinate of the b-th frame of the image to be detected, (w)_a，h_a) Width and height of the a-th frame, (w)_b，h_b) The width and height of the b-th frame.

4. The image description generation method of claim 1, wherein the inputting the region features into a preset label attention model for weight calculation and outputting the category embedding of the image to be detected comprises:

5. The image description generation method of claim 4, wherein the preset tag attention model includes a plurality of attention modules, each of the attention modules includes an independent scaling dot product attention function, and the encoding embedding of the predicted word according to the preset first attention formula to obtain the category embedding of the image to be detected includes:

6. The image description generation method according to claim 1, wherein the encoder includes a plurality of identical encoding layers, each encoding layer includes a multi-headed attention sublayer and a position feed-forward sublayer, the multi-headed attention sublayer includes a plurality of parallel head modules, the inputting the region feature into an encoder of a preset transformation model for processing, and outputting an output value of the encoder includes:

b4, repeat B1-B3 output values for all encoded layers.

7. The image description generation method of claim 1, wherein the decoder includes a plurality of identical decoding layers, each decoding layer includes a multi-head self-attention sublayer, a multi-head cross-attention sublayer and a position forward sublayer, and the embedding of the output values and the categories into the decoder of the preset transformation model for processing generates the description text of the image to be detected, including:

8. An image description generation apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

10. A computer-readable medium storing an image description generation that, when executed by a processor, implements the image description generation method of any one of claims 1 to 7.