WO2023134073A1 - Artificial intelligence-based image description generation method and apparatus, device, and medium - Google Patents

Artificial intelligence-based image description generation method and apparatus, device, and medium Download PDF

Info

Publication number
WO2023134073A1
WO2023134073A1 PCT/CN2022/090158 CN2022090158W WO2023134073A1 WO 2023134073 A1 WO2023134073 A1 WO 2023134073A1 CN 2022090158 W CN2022090158 W CN 2022090158W WO 2023134073 A1 WO2023134073 A1 WO 2023134073A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
feature
target
analyzed
Prior art date
Application number
PCT/CN2022/090158
Other languages
French (fr)
Chinese (zh)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023134073A1 publication Critical patent/WO2023134073A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of knowledge representation and reasoning of artificial intelligence, in particular to an artificial intelligence-based image description generation method, device, equipment and medium.
  • Image description is to automatically generate a descriptive text for an image.
  • the image description model not only needs to detect the objects in the image, recognize the text in the image, understand the relationship between objects, but also accurately describe the image information in language.
  • the inventor realized that the text in the image is crucial for humans to understand the image information.
  • the model tries to understand the image with text, it not only detects the target, but also understands the text of the image and connects it to the environment for understanding.
  • the image is "There is a red ring on the brick wall, there is a text area on the diameter of the ring, and the text Mornington Crescent in the text area", the current image description model recognizes "There is a sign on the brick wall” , but not enough to understand "Mornington Crescent written in a red circle on the wall”. Since the current image description model cannot understand the text of the image and understand it in relation to the environment, it cannot fully describe the rich information in the image.
  • the main purpose of this application is to provide an artificial intelligence-based image description generation method, device, device, and medium, aiming to solve the problem that the image description model cannot understand the text of the image and connect it to the environment when performing image description in the prior art Understanding, leading to technical problems that cannot describe the rich information in the picture in detail.
  • the present application proposes a method for generating image descriptions based on artificial intelligence, the method comprising:
  • an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  • the present application also proposes an artificial intelligence-based image description generating device, the device comprising:
  • An image acquisition module configured to acquire an image to be described
  • a text area detection module configured to perform text area detection according to the image to be described
  • a text recognition module configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
  • a target feature extraction module configured to extract target features according to the image to be described
  • the image description generating module is used to generate an image description based on the multi-modal feature fusion method according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.
  • the present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned image description generation method based on artificial intelligence are realized, wherein the Said method comprises steps:
  • an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  • the present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned artificial intelligence-based image description generation method are implemented, wherein the method includes the steps of:
  • an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  • the artificial intelligence-based image description generation method, device, device, and medium of the present application wherein the method first performs text region detection based on the image to be described, and performs text recognition on each of the text regions based on the image to be described, Obtain the text to be analyzed, then perform target feature extraction based on the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed, and each of the target features based on the method of multimodal feature fusion , to get the image description result.
  • the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
  • Fig. 1 is a schematic flow chart of an image description generation method based on artificial intelligence according to an embodiment of the present application
  • FIG. 2 is a schematic block diagram of an artificial intelligence-based image description generation device according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • an artificial intelligence-based image description generation method is provided in an embodiment of the present application, the method comprising:
  • S3 Perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
  • S5 Based on the method of multi-modal feature fusion, generate an image description according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.
  • text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained.
  • the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
  • the image to be described input by the user can be obtained, the image to be described can also be exhaled from the database, and the image to be described can also be obtained from a third-party application system.
  • the image to be described is an image that needs to be described in language.
  • the text area is the area corresponding to the text box corresponding to the text in the image to be described.
  • image to be described corresponds to one or more text regions.
  • the text to be analyzed includes: characters and/or symbols.
  • the model obtained based on the MASK-RCNN (semantic segmentation algorithm) network is used to extract the target feature of each target from the image to be described.
  • Objects include: objects, backgrounds.
  • Target features include: target appearance features, target position information and target mask map.
  • the appearance feature of the target is the appearance feature of the target, that is, the image feature extracted from the image block corresponding to the target.
  • the target position information is the position information of the target in the image to be described.
  • the target mask map is a mask map (that is, a MASK map) generated according to the appearance characteristics of the target.
  • the various features corresponding to each of the texts to be analyzed are mapped to the general vector embedding space obtained through learning, and the various features corresponding to each of the target features are mapped to the
  • the universal vector embedding space obtained by learning is then input into the model obtained based on the multi-layer Transformer network to obtain the predicted words within the scope of the predefined code table, and finally from the recognized token (the text to be analyzed) word) and the words predicted within the scope of the predefined code table, select a word as the current output, combine the last output words into a sentence according to the output order, and use the sentence as the image description result, wherein, based on the multi-layer Transformer network In the process of cyclic decoding, the obtained model will use auto-regression to take the previous output word as part of the current input. That is to say, the image description result is a sentence for voice description of the image to be described.
  • the above-mentioned step of performing text region detection according to the image to be described includes:
  • S24 Perform text probability map prediction according to the feature layer to be analyzed to obtain a target text probability map
  • S25 Perform dynamic threshold map prediction according to the feature layer to be analyzed to obtain a target dynamic threshold map
  • S26 Perform differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;
  • the existing DB algorithm directly predicts the probability map of the text area, and then judges whether each pixel is text or background according to the artificially set threshold. This processing method is too rough, and it is often difficult to detect the boundary. A single threshold will lead to The boundary is not accurate enough; and there are a lot of dense text in natural scenes, the boundary that is not accurate enough will make the dense text detection effect poor.
  • this embodiment implements down-sampling, up-sampling and cascade processing on the image to be described, and then performs text probability map prediction and dynamic threshold map prediction on the feature layers obtained by the cascade processing, and finally according to the text Differentiable binarization calculations are performed on the probability map and dynamic threshold map, and the text area is generated according to the calculation results, so that more attention is paid to the boundary area.
  • Differentiable binarization can set the threshold value in the decision layer of the model, allowing the model to automatically Judging the threshold value set by different pixel points improves the accuracy of boundary recognition, improves the effect of dense text detection, and improves the accuracy of image description.
  • a preset image segmentation model is used to perform downsampling processing on the image to be described, and the features obtained by downsampling are used as downsampling features.
  • the preset image segmentation model is a model obtained based on FCN (Fully Convolutional Network) network training. It can be understood that the preset image segmentation model may also be a model trained based on other neural networks, which is not limited here.
  • the preset image segmentation model is used to perform up-sampling processing on the down-sampled features, and the up-sampled features are used as up-sampled features.
  • high-level features and low-level features can be well used to detect text areas of various scales.
  • the high-level receptive field is large to detect large text areas, and the low-level To detect small text regions, so that the obtained upsampled features contain high-level rich semantic information and low-level rich representation information, making the feature expression more perfect.
  • each feature in the upsampled features is concatenated into a feature layer, and the feature layer obtained through the cascading process is used as the feature layer to be analyzed.
  • the preset text probability map prediction model is used to predict the text probability map according to the feature layer to be analyzed, and the predicted text probability map is used as the target text probability map.
  • the text probability map is a map formed by calculating the probability that a pixel belongs to text.
  • the preset text probability map prediction model is a model obtained by training a neural network by using a plurality of first training samples.
  • the first training samples include: a first image sample and a first calibration value for each pixel.
  • the first image sample is an image containing text.
  • the first calibration value of each pixel is a calibration value of whether each pixel in the first image sample is a text area.
  • the first calibration value of each pixel has two values, which are 0 (not a text area) and 1 (a text area).
  • a preset dynamic threshold map prediction model is used to predict the dynamic threshold map according to the feature layer to be analyzed, and the predicted dynamic threshold map is used as the target dynamic threshold map.
  • the dynamic threshold map is a map formed according to the dynamic threshold of each pixel.
  • the preset dynamic threshold map prediction model is a model obtained by training a neural network by using a plurality of second training samples.
  • the second training samples include: a second image sample and a second calibration value for each pixel.
  • the second image sample is an image containing text.
  • the second calibration value of each pixel is a calibration value of whether each pixel in the second image sample is the boundary of the text area.
  • the calibration value of each pixel has two values, which are 0 (not the boundary of the text area) and 1 (the boundary of the text area).
  • the differentiable binarization calculation formula of the pixel point in row i and column j in the differentiable binarization map is:
  • P ij is the probability of the pixel point in the i-th row j column in the target text probability map
  • T ij is the threshold value of the pixel point in the i-th row j column in the target dynamic threshold value map
  • k is Amplification factor
  • k is a constant
  • e is a constant of nature.
  • the pixels whose value is greater than the preset threshold are used as the pixels of the text area, and the pixels whose value is less than or equal to the preset threshold are used as the pixels of the non-text area, and the adjacent
  • the pixels of the text area are connected into graphic blocks, and each graphic block is regarded as a text area.
  • the above-mentioned step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:
  • S34 Input each of the time-series feature atlases into a model based on a cyclic neural network for text recognition, and obtain the text to be analyzed corresponding to each of the text regions, wherein each preset in the preset tag dictionary is used
  • the label is used as a predicted label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, and the preset label includes: text and a placeholder.
  • the text image block corresponding to the text area is subjected to feature map extraction at a preset height, and the feature maps are sorted by position to generate a time-series feature atlas, and then each of the time-series feature atlases is input into the cyclic neural network-based
  • the model performs text recognition.
  • the width of the feature maps is reduced, which is conducive to the training of the model; because the width of the image after feature extraction by the model based on the convolutional neural network is The number of times the image is predicted from left to right, so the more the number of divisions, the less likely it is to miss characters, but too many times can easily cause a character to be recognized multiple times, resulting in the inability to judge whether there are repeated characters, which affects the Accuracy of text recognition, the present embodiment includes: text and placeholder by preset label, realizes placeholder is used as preset label, same character is separated by placeholder, there is no difference between codes corresponding to same character when decoding For the encoding corresponding to the placeholder, only one of the same characters needs to be reserved, thereby further improving the accuracy of text recognition.
  • the model obtained based on the convolutional neural network is used to extract the feature map of the preset height for each of the text image blocks, that is to say, the height of the feature map is the same as the preset height, and the extracted The obtained feature maps are used as a feature map set.
  • CNN convolutional neural network
  • the feature maps in each feature map set are sorted by position, so as to obtain a time series feature map set with the sequence of positions as time series.
  • each of the time series feature atlases is input into the model obtained based on the cyclic neural network for text recognition, because each preset label in the preset label dictionary is used as the embedding layer of the model obtained based on the cyclic neural network
  • the predicted label of the output dimension, so the range of the text recognition result is the preset label in the preset label dictionary.
  • the width of the image after feature extraction by the model based on the convolutional neural network is the number of times the image is predicted from left to right, the more the number of segmentations, the less likely it is to miss characters, but too many times are likely to cause a character to be predicted.
  • the situation of multiple recognitions makes it impossible to judge whether there are repeated characters, which affects the accuracy of text recognition.
  • the placeholder is used as the default label, and the same character is used as the placeholder Characters are separated, and there is no code corresponding to the placeholder between the codes corresponding to the same character during decoding, so one value needs to be reserved for the same character, thereby further improving the accuracy of text recognition.
  • the preset label dictionary includes: preset labels and codes, and one-to-one correspondence between preset labels and codes.
  • the model based on the recurrent neural network is a model based on the bidirectional LSTM (long short-term memory artificial neural network) network.
  • the above-mentioned step of extracting target features according to the image to be described includes:
  • S44 Perform classification prediction according to the image features of each region to obtain a classification prediction result, wherein the classification prediction classification labels include: multiple object labels and a background label;
  • S45 Perform position regression processing according to each target appearance feature to obtain target position information
  • each target feature is extracted based on the model obtained by ResNet-FPN, based on the ROI Align method, and based on the model obtained by the full convolution network.
  • the target features include the target appearance feature, the target position information and the target.
  • the mask map provides the basis for the subsequent understanding of the relationship between objects, and provides a basis for the method based on multi-modal feature fusion to understand the text-connected environment of the understood image to generate an image description.
  • FPN the full name of Feature Pyramid Network
  • VGG deep convolutional neural network
  • ResNet residual network
  • the model obtained based on the region generation network is used, and the target candidate region is extracted according to the image to be described, wherein the model obtained based on the region generation network adopts a frame of a preset size (also called a priori frame, Anchor), Predict the candidate regions belonging to the target, and use each predicted candidate region as an object candidate region.
  • a preset size also called a priori frame, Anchor
  • the model obtained based on the region generation network is the model obtained based on the RPN (RegionProposal Network).
  • an image feature map corresponding to each of the target candidate regions is extracted from the image feature map to be analyzed, and the image features corresponding to each of the target candidate regions are used as target appearance features.
  • ROI Align is a regional feature aggregation method.
  • the ROI Align method first maps the RoI (solid line part) to the feature map (dotted line part); then divides the feature map into 2*2 units, and assumes that each unit contains four sampling points, and each sampling Points are not integer coordinates, so bilinear interpolation is needed to estimate each sampling point; finally, pooling calculation is performed in each unit, and each unit gets a value, thus obtaining a 2*2 feature map.
  • the classification labels predicted by classification include: a plurality of object labels and a background label, and classification prediction is performed according to each of the region image features, so as to predict whether the image corresponding to the region image features is a background or an object. It can be understood that the object predicted here refers to what kind of object is predicted.
  • FC layer full connection layer
  • softmax activation function regression classification function
  • For S45 perform target position regression processing according to each target appearance feature, and use the predicted position information as the target position information.
  • the target position information includes: that is, the position information of the detection frame corresponding to the target.
  • the model obtained based on the fully convolutional network is used to perform pixel-by-pixel mask prediction according to each of the target appearance features, and the target mask map is obtained according to the mask prediction results.
  • the target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature, that is, the target in the target feature
  • the appearance feature, the target position information and the target mask map all correspond to the same target. It can be understood that the number of the target feature may be one or more.
  • the above-mentioned method based on multimodal feature fusion, performing image description generation according to the image to be described, each of the texts to be analyzed and each of the target features, and the step of obtaining the image description result include:
  • S51 Perform feature fusion according to each of the target features to obtain a first fusion feature
  • S54 Use the model obtained based on the dynamic pointer network to generate an image description according to the word prediction result and each text to be analyzed, and obtain the image description result.
  • feature fusion is first performed on the target features, and then feature fusion is performed on the text to be analyzed. Finally, word prediction is performed based on the result of two feature fusions using the model obtained based on iterative Transformer, and finally the word prediction is performed based on the result of word prediction and the text to be analyzed.
  • Image description generation realizes the method based on multi-modal feature fusion to understand the text of the understood image in relation to the environment to generate an image description, thereby expressing the rich information of the image in detail and completely, and improving the accuracy of image description .
  • the input of the model obtained based on the iterative Transformer includes: the first fusion feature, the second fusion feature and the word predicted at the previous moment. For example, when predicting the third word, the predicted second word will be input into the model based on iterative Transformer as the word predicted at the previous moment.
  • the structure of Transformer is also composed of encoder (encoding) and decoder (decoding), and Transformer uses Attention (attention) mechanism to solve the problem of natural language translation.
  • One or more words are included in the word prediction result.
  • the model based on the dynamic pointer network is the model based on DPN (Dynamic pointer network).
  • Words come out as words for image descriptions. It can be understood that the words selected by the model based on the dynamic pointer network are used for sentence splicing, and the spliced sentence is used as the image description result.
  • the above-mentioned step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:
  • S512 Encode the target position information in the target feature to be processed to obtain the target position code
  • S513 Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first value corresponding to the target feature to be processed.
  • S514 Repeat the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
  • the various features of the target feature are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the understanding of the text-related environment of the understood image based on the method of multi-modal feature fusion provides the basis for generating image descriptions.
  • any one of the target features is acquired from each of the target features, and the acquired target feature is used as the target feature to be processed.
  • the target location code is expressed as xmin is the minimum value of the horizontal pixel position, ymin is the minimum value of the vertical pixel position, xmax is the maximum value of the horizontal pixel position, ymax is the maximum value of the vertical pixel position, Wim is the width, and Him is the height.
  • LN Linear Normalization, that is, linear normalization processing
  • W1 and W2 are preset constants
  • the calculation result of is the first fusion feature corresponding to the target feature to be processed.
  • step S511 to step S514 are repeatedly executed until the acquisition of the target feature is completed.
  • the acquisition of the target features it means that the feature fusion of each target feature is completed.
  • the above-mentioned step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain the second fusion feature includes:
  • S522 Perform word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector
  • S523 According to the text to be processed, perform image feature extraction from the image to be described to obtain text block image features;
  • S524 Perform text encoding of a third preset dimension on the text to be processed to obtain text block encoding features
  • S525 Determine the position information of the text to be processed according to the image to be described, and obtain text position information
  • S526 Encode the text position information to obtain a text position code
  • S527 Perform a linear change on the text block word vector, the text block image feature, the text block encoding feature, and the text position encoding to map to a vector embedding space of a fourth preset dimension, to obtain the The second fusion feature corresponding to the text to be processed;
  • the various features of the text to be analyzed are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the text connection environment of the understood image in order to realize the method based on multi-modal feature fusion Understanding to generate image descriptions provides the basis.
  • the second preset dimension is set to 300.
  • FastText word vector is a word vector generated by FastText.
  • FastText is a word vector calculation and text classification tool.
  • the PHOC (Pyramidal Histogram of Characters) encoding method is used to perform text encoding of the third preset dimension on the text to be processed, and use the encoded data as the text block encoding feature.
  • the third preset dimension is set to 604 dimensions.
  • the text location information is the location information of the text area corresponding to the text to be processed in the image to be described.
  • LN Linear Normalization, that is, linear normalization processing
  • W3, W5 and W6 are preset constants
  • is the text block word vector is the image feature of the text block
  • the calculation result is the second fusion feature corresponding to the text to be processed.
  • step S521 to step S528 are repeatedly executed until the acquisition of the text to be analyzed is completed.
  • the acquisition of the text to be analyzed it means that the feature fusion of each text to be analyzed is completed.
  • the present application also proposes a device for generating image descriptions based on artificial intelligence, the device comprising:
  • An image acquisition module 100 configured to acquire an image to be described
  • a text area detection module 200 configured to perform text area detection according to the image to be described
  • a text recognition module 300 configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
  • a target feature extraction module 400 configured to extract target features according to the image to be described
  • the image description generating module 500 is configured to generate an image description based on the multimodal feature fusion method according to the image to be described, each text to be analyzed and each target feature, and obtain an image description result.
  • text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained.
  • the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
  • an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer programs and databases.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as an image description generation method based on artificial intelligence.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, an image description generation method based on artificial intelligence is realized, wherein the method includes the steps of:
  • an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  • text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained.
  • the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
  • An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • an image description generation method based on artificial intelligence is implemented, wherein the method includes steps:
  • an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  • the image description generation method based on artificial intelligence performed above, first, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described, to obtain the text to be analyzed, and then according to the image to be described, Extract the target feature from the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed and each of the target features based on the method of multimodal feature fusion, and obtain an image description result.
  • the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The present application relates to the technical field of knowledge representation and reasoning of artificial intelligence and discloses an artificial intelligence-based image description generation method and apparatus, a device, and a medium. The method comprises: obtaining an image to be described; performing text region detection according to the image to be described; performing text recognition on each text region according to the image to be described to obtain texts to be analyzed; performing target feature extraction according to the image to be described; and performing image description generation on the basis of a multi-modal feature fusion method and according to the image to be described, each text to be analyzed and each target feature to obtain an image description result. On the basis of a multi-modal feature fusion method, understanding is performed on a text-associated environment of an understood image to generate an image description, and thereby the abundant information in the image is completely expressed in language, and the accuracy of image description is improved.

Description

基于人工智能的图像描述生成方法、装置、设备及介质Image description generation method, device, equipment and medium based on artificial intelligence
本申请要求于 20220111日提交中国专利局、申请号为 202210028089.7,发明名称为“ 基于人工智能的图像描述生成方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。 This application claims the priority of the Chinese patent application with the application number 202210028089.7 submitted to the China Patent Office on January 11 , 2022 , and the title of the invention is " artificial intelligence-based image description generation method, device, equipment and medium ", the entire content of which Incorporated in this application by reference.
技术领域technical field
本申请涉及到人工智能的知识表示与推理技术领域,特别是涉及到一种基于人工智能的图像描述生成方法、装置、设备及介质。This application relates to the technical field of knowledge representation and reasoning of artificial intelligence, in particular to an artificial intelligence-based image description generation method, device, equipment and medium.
背景技术Background technique
图像描述就是对图像自动生成一段描述性的文字。图像描述的模型不仅要检测出图像中的物体、识别出图像中的文本、理解物体之间的相互关系,最后还要将图像信息用语言准确地描述出来。发明人意识到,图像中的文本对人类理解图片信息是至关重要的,当模型试图理解含文本的图像时,不仅仅是目标检测,还有理解图像的文本并且将它联系环境进行理解。比如,图像是“砖墙上有一个红色圆环,圆环的直径上有一个文本区域,文本区域中有一段文字Mornington Crescent”,目前的图像描述的模型识别出“砖墙上有一个标志”,但是不足以理解“墙上的红色圆环里写着Mornington Crescent”。由于目前的图像描述的模型无法理解图像的文本并且将它联系环境进行理解,导致无法详尽描述出图片中丰富的信息。Image description is to automatically generate a descriptive text for an image. The image description model not only needs to detect the objects in the image, recognize the text in the image, understand the relationship between objects, but also accurately describe the image information in language. The inventor realized that the text in the image is crucial for humans to understand the image information. When the model tries to understand the image with text, it not only detects the target, but also understands the text of the image and connects it to the environment for understanding. For example, the image is "There is a red ring on the brick wall, there is a text area on the diameter of the ring, and the text Mornington Crescent in the text area", the current image description model recognizes "There is a sign on the brick wall" , but not enough to understand "Mornington Crescent written in a red circle on the wall". Since the current image description model cannot understand the text of the image and understand it in relation to the environment, it cannot fully describe the rich information in the image.
技术问题technical problem
本申请的主要目的为提供一种基于人工智能的图像描述生成方法、装置、设备及介质,旨在解决现有技术在进行图像描述时,图像描述的模型无法理解图像的文本并且将它联系环境进行理解,导致无法详尽描述出图片中丰富的信息的技术问题。The main purpose of this application is to provide an artificial intelligence-based image description generation method, device, device, and medium, aiming to solve the problem that the image description model cannot understand the text of the image and connect it to the environment when performing image description in the prior art Understanding, leading to technical problems that cannot describe the rich information in the picture in detail.
技术解决方案technical solution
本申请提出一种基于人工智能的图像描述生成方法,所述方法包括:The present application proposes a method for generating image descriptions based on artificial intelligence, the method comprising:
获取待描述图像;Get the image to be described;
根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
本申请还提出了一种基于人工智能的图像描述生成装置,所述装置包括:The present application also proposes an artificial intelligence-based image description generating device, the device comprising:
图像获取模块,用于获取待描述图像;An image acquisition module, configured to acquire an image to be described;
文本区域检测模块,用于根据所述待描述图像进行文本区域检测;A text area detection module, configured to perform text area detection according to the image to be described;
文本识别模块,用于根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;A text recognition module, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
目标特征提取模块,用于根据所述待描述图像进行目标特征提取;A target feature extraction module, configured to extract target features according to the image to be described;
图像描述生成模块,用于基于多模态特征融合的方法,根据所述待描述图像、 各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。The image description generating module is used to generate an image description based on the multi-modal feature fusion method according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.
本申请还提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述基于人工智能的图像描述生成方法的步骤,其中,所述方法包括步骤:The present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned image description generation method based on artificial intelligence are realized, wherein the Said method comprises steps:
获取待描述图像;Get the image to be described;
根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
本申请还提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述基于人工智能的图像描述生成方法的步骤,其中,所述方法包括步骤:The present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned artificial intelligence-based image description generation method are implemented, wherein the method includes the steps of:
获取待描述图像;Get the image to be described;
根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
有益效果Beneficial effect
本申请的基于人工智能的图像描述生成方法、装置、设备及介质,其中方法首先根据所述待描述图像进行文本区域检测,根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本,然后根据所述待描述图像进行目标特征提取,最后基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。通过基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而实现将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。The artificial intelligence-based image description generation method, device, device, and medium of the present application, wherein the method first performs text region detection based on the image to be described, and performs text recognition on each of the text regions based on the image to be described, Obtain the text to be analyzed, then perform target feature extraction based on the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed, and each of the target features based on the method of multimodal feature fusion , to get the image description result. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
附图说明Description of drawings
图1为本申请一实施例的基于人工智能的图像描述生成方法的流程示意图;Fig. 1 is a schematic flow chart of an image description generation method based on artificial intelligence according to an embodiment of the present application;
图2为本申请一实施例的基于人工智能的图像描述生成装置的结构示意框图;FIG. 2 is a schematic block diagram of an artificial intelligence-based image description generation device according to an embodiment of the present application;
图3为本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.
参照图1,本申请实施例中提供一种基于人工智能的图像描述生成方法,所述方法包括:Referring to FIG. 1, an artificial intelligence-based image description generation method is provided in an embodiment of the present application, the method comprising:
S1:获取待描述图像;S1: Acquire the image to be described;
S2:根据所述待描述图像进行文本区域检测;S2: Perform text region detection according to the image to be described;
S3:根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析 文本;S3: Perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
S4:根据所述待描述图像进行目标特征提取;S4: Extract target features according to the image to be described;
S5:基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。S5: Based on the method of multi-modal feature fusion, generate an image description according to the image to be described, each of the texts to be analyzed and each of the target features, and obtain an image description result.
本实施例首先根据所述待描述图像进行文本区域检测,根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本,然后根据所述待描述图像进行目标特征提取,最后基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。通过基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而实现将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。In this embodiment, firstly, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
对于S1,可以获取用户输入的待描述图像,也可以从数据库中呼气待描述图像,还可以从第三方应用系统中获取待描述图像。For S1, the image to be described input by the user can be obtained, the image to be described can also be exhaled from the database, and the image to be described can also be obtained from a third-party application system.
待描述图像,是需要采用语言描述的图像。The image to be described is an image that needs to be described in language.
对于S2,采用基于DB(Differentiable Binarization)算法得到的模型,对所述待描述图像进行文本区域检测。For S2, use a model based on the DB (Differentiable Binarization) algorithm to perform text region detection on the image to be described.
文本区域也就是待描述图像中的文本对应的文本框对应的区域。The text area is the area corresponding to the text box corresponding to the text in the image to be described.
可以理解的是,所述待描述图像对应一个或多个文本区域。It can be understood that the image to be described corresponds to one or more text regions.
对于S3,采用基于CRNN(An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition)网络得到的模型,根据所述待描述图像,对每个所述文本区域进行文本识别,将每个所述文本区域对应的文本作为一个待分析文本。For S3, use a model based on the CRNN (An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition) network, and perform a process on each text region according to the image to be described Text recognition, using the text corresponding to each text area as a text to be analyzed.
也就是说,待分析文本与所述文本区域一一对应。That is to say, there is a one-to-one correspondence between the text to be analyzed and the text regions.
待分析文本包括:文字和/或符号。The text to be analyzed includes: characters and/or symbols.
对于S4,采用基于MASK-RCNN(语义分割的算法)网络得到的模型,对所述待描述图像进行每个目标的目标特征提取。For S4, the model obtained based on the MASK-RCNN (semantic segmentation algorithm) network is used to extract the target feature of each target from the image to be described.
目标包括:物体、背景。Objects include: objects, backgrounds.
目标特征包括:目标外观特征、目标位置信息和目标掩膜图。目标外观特征,是目标的外观特征,也就是从目标对应的图像块中提取得到的图像特征。目标位置信息,是目标在待描述图像中的位置信息。目标掩膜图,是根据目标外观特征生成的掩膜图(也就是,MASK图)。Target features include: target appearance features, target position information and target mask map. The appearance feature of the target is the appearance feature of the target, that is, the image feature extracted from the image block corresponding to the target. The target position information is the position information of the target in the image to be described. The target mask map is a mask map (that is, a MASK map) generated according to the appearance characteristics of the target.
对于S5,基于多模态特征融合的方法,首先将各个所述待分析文本对应的各种特征映射到经过学习得到的通用向量嵌入空间,将各个所述目标特征对应的各种特征映射到经过学习得到的通用向量嵌入空间,然后将映射特征输入到基于多层Transformer网络得到的模型中,得到在预定义码表的范围内预测词语,最后从识别出的token(所述待分析文本中的词)和在预定义码表的范围内预测的词语中选择一个词作为当前输出,将最后输出的各个词按输出顺序组合成句子,将该句子作为图像描述结果,其中,基于多层Transformer网络得到的模型在循环解码的过程中,会采用自回归的方式,将前一个输出的词作为当前输入的一部分。也就是说,图像描述结果是对待描述图像进行语音描述的句子。For S5, based on the method of multimodal feature fusion, firstly, the various features corresponding to each of the texts to be analyzed are mapped to the general vector embedding space obtained through learning, and the various features corresponding to each of the target features are mapped to the The universal vector embedding space obtained by learning is then input into the model obtained based on the multi-layer Transformer network to obtain the predicted words within the scope of the predefined code table, and finally from the recognized token (the text to be analyzed) word) and the words predicted within the scope of the predefined code table, select a word as the current output, combine the last output words into a sentence according to the output order, and use the sentence as the image description result, wherein, based on the multi-layer Transformer network In the process of cyclic decoding, the obtained model will use auto-regression to take the previous output word as part of the current input. That is to say, the image description result is a sentence for voice description of the image to be described.
在一个实施例中,上述根据所述待描述图像进行文本区域检测的步骤,包括:In one embodiment, the above-mentioned step of performing text region detection according to the image to be described includes:
S21:对所述待描述图像进行下采样处理,得到下采样特征;S21: Perform downsampling processing on the image to be described to obtain downsampling features;
S22:对所述下采样特征进行上采样处理,得到上采样特征;S22: Perform upsampling processing on the downsampled features to obtain upsampled features;
S23:对所述上采样特征进行级联处理,得到待分析特征层;S23: Perform cascading processing on the upsampled features to obtain a feature layer to be analyzed;
S24:根据所述待分析特征层进行文本概率图预测,得到目标文本概率图;S24: Perform text probability map prediction according to the feature layer to be analyzed to obtain a target text probability map;
S25:根据所述待分析特征层进行动态阈值图预测,得到目标动态阈值图;S25: Perform dynamic threshold map prediction according to the feature layer to be analyzed to obtain a target dynamic threshold map;
S26:根据所述目标文本概率图和所述目标动态阈值图进行可微分二值化计算,得到可微分二值化图;S26: Perform differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;
S27:根据所述可微分二值化图进行所述文本区域生成。S27: Generate the text region according to the differentiable binarization map.
现有的DB算法是直接预测文本区域的概率图,然后根据人为设定的阈值判断每个像素是文字还是背景,这种处理方式过于粗暴,对于边界的检测往往比较难,单一的阈值会导致边界不够准确;并且自然场景中存在大量密集文本,不够准确的边界会让密集文本检测效果很差。为了解决上述问题,本实施例实现了对待描述图像分别进行下采样、上采样和级联处理,然后对级联处理的得到的特征层分别进行文本概率图预测和动态阈值图预测,最后根据文本概率图和动态阈值图进行可微分二值化计算,根据计算结果生成文本区域,实现了更关注边界区域,可微分二值化可以将阈值的设定放在模型的判定层中,让模型自行判断不同像素点设定的阈值大小,提高了边界识别的准确性,提高了密集文本检测效果,提高了图像描述的准确性。The existing DB algorithm directly predicts the probability map of the text area, and then judges whether each pixel is text or background according to the artificially set threshold. This processing method is too rough, and it is often difficult to detect the boundary. A single threshold will lead to The boundary is not accurate enough; and there are a lot of dense text in natural scenes, the boundary that is not accurate enough will make the dense text detection effect poor. In order to solve the above problems, this embodiment implements down-sampling, up-sampling and cascade processing on the image to be described, and then performs text probability map prediction and dynamic threshold map prediction on the feature layers obtained by the cascade processing, and finally according to the text Differentiable binarization calculations are performed on the probability map and dynamic threshold map, and the text area is generated according to the calculation results, so that more attention is paid to the boundary area. Differentiable binarization can set the threshold value in the decision layer of the model, allowing the model to automatically Judging the threshold value set by different pixel points improves the accuracy of boundary recognition, improves the effect of dense text detection, and improves the accuracy of image description.
对于S21,采用预设图像分割模型,对所述待描述图像进行下采样处理,将下采样得到的特征作为下采样特征。For S21, a preset image segmentation model is used to perform downsampling processing on the image to be described, and the features obtained by downsampling are used as downsampling features.
可选的,预设图像分割模型是基于FCN(Fully Convolutional Network)网络训练得到的模型。可以理解的是,预设图像分割模型还可以是基于其他神经网络训练得到的模型,在此不做限定。Optionally, the preset image segmentation model is a model obtained based on FCN (Fully Convolutional Network) network training. It can be understood that the preset image segmentation model may also be a model trained based on other neural networks, which is not limited here.
对于S22,采用预设图像分割模型,对所述下采样特征进行上采样处理,将上采样得到的特征作为上采样特征。通过预设图像分割模型先分别进行下采样和上采样,可以很好的利用高层特征和低层特征,便于检测各种尺度的文本区域,高层感受野大以用来检测大的文本区域,低层用来检测小的文本区域,从而使得到的上采样特征包含高层丰富的语义信息、以及低层丰富的表征信息,使特征表达更加完善。For S22, the preset image segmentation model is used to perform up-sampling processing on the down-sampled features, and the up-sampled features are used as up-sampled features. By presetting the image segmentation model to perform down-sampling and up-sampling respectively, high-level features and low-level features can be well used to detect text areas of various scales. The high-level receptive field is large to detect large text areas, and the low-level To detect small text regions, so that the obtained upsampled features contain high-level rich semantic information and low-level rich representation information, making the feature expression more perfect.
对于S23,对所述上采样特征中的各个特征级联为一个特征层,将级联处理得到的特征层作为待分析特征层。For S23, each feature in the upsampled features is concatenated into a feature layer, and the feature layer obtained through the cascading process is used as the feature layer to be analyzed.
对于S24,采用预设文本概率图预测模型,根据所述待分析特征层进行文本概率图预测,将预测得到的文本概率图作为目标文本概率图。For S24, the preset text probability map prediction model is used to predict the text probability map according to the feature layer to be analyzed, and the predicted text probability map is used as the target text probability map.
文本概率图,是计算像素属于文本的概率形成的图。The text probability map is a map formed by calculating the probability that a pixel belongs to text.
预设文本概率图预测模型,是采用多个第一训练样本对神经网络训练得到的模型。第一训练样本包括:第一图像样本和每个像素的第一标定值。第一图像样本是包含文本的图像。每个像素的第一标定值,是第一图像样本中的每个像素是否为文本区域的标定值。每个像素的第一标定值的取值有两个,分别是0(不是文本区域)和1(是文本区域)。The preset text probability map prediction model is a model obtained by training a neural network by using a plurality of first training samples. The first training samples include: a first image sample and a first calibration value for each pixel. The first image sample is an image containing text. The first calibration value of each pixel is a calibration value of whether each pixel in the first image sample is a text area. The first calibration value of each pixel has two values, which are 0 (not a text area) and 1 (a text area).
对于S25,采用预设动态阈值图预测模型,根据所述待分析特征层进行动态阈值图预测,将预测得到的动态阈值图作为目标动态阈值图。For S25, a preset dynamic threshold map prediction model is used to predict the dynamic threshold map according to the feature layer to be analyzed, and the predicted dynamic threshold map is used as the target dynamic threshold map.
动态阈值图是根据各个像素的动态阈值形成的图。The dynamic threshold map is a map formed according to the dynamic threshold of each pixel.
预设动态阈值图预测模型,是采用多个第二训练样本对神经网络训练得到的模型。第二训练样本包括:第二图像样本和每个像素的第二标定值。第二图像样本是包含文本的图像。每个像素的第二标定值,是第二图像样本中的每个像素是否为文本区域边界的标定值。每个像素的标定值的取值有两个,分别是0(不是文本区域边界)和1(是文本区域边界)。The preset dynamic threshold map prediction model is a model obtained by training a neural network by using a plurality of second training samples. The second training samples include: a second image sample and a second calibration value for each pixel. The second image sample is an image containing text. The second calibration value of each pixel is a calibration value of whether each pixel in the second image sample is the boundary of the text area. The calibration value of each pixel has two values, which are 0 (not the boundary of the text area) and 1 (the boundary of the text area).
对于S26,可微分二值化图中第i行第j列的像素点的可微分二值化计算公式为:For S26, the differentiable binarization calculation formula of the pixel point in row i and column j in the differentiable binarization map is:
Figure PCTCN2022090158-appb-000001
Figure PCTCN2022090158-appb-000001
其中,P ij是所述目标文本概率图中的第i行第j列的像素点的概率,T ij是所述目标动态阈值图中的第i行第j列的像素点的阈值,k是放大因子,k是常数,e是自然常数。 Wherein, P ij is the probability of the pixel point in the i-th row j column in the target text probability map, T ij is the threshold value of the pixel point in the i-th row j column in the target dynamic threshold value map, and k is Amplification factor, k is a constant, e is a constant of nature.
对于S27,将所述可微分二值化图中,值大于预设阈值的像素点作为文本区域的像素点,值小于或等于预设阈值的像素点作为非文本区域的像素点,将相邻的文本区域的像素点连通成图形块,将每个图形块作为一个文本区域。For S27, in the differentiable binarization map, the pixels whose value is greater than the preset threshold are used as the pixels of the text area, and the pixels whose value is less than or equal to the preset threshold are used as the pixels of the non-text area, and the adjacent The pixels of the text area are connected into graphic blocks, and each graphic block is regarded as a text area.
在一个实施例中,上述根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本的步骤,包括:In one embodiment, the above-mentioned step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:
S31:根据每个所述文本区域,从所述待描述图像中提取图像区块,得到文本图像区块;S31: According to each of the text regions, extract image blocks from the image to be described to obtain text image blocks;
S32:采用基于卷积神经网络得到的模型,对每个所述文本图像区块进行预设高度的特征图提取,得到特征图集;S32: Using a model obtained based on a convolutional neural network, perform feature map extraction with a preset height on each of the text image blocks to obtain a feature map set;
S33:将每个所述特征图集中的各个所述特征图按位置进行排序,得到时序特征图集;S33: Sort each of the feature maps in each of the feature atlases by position to obtain a time-series feature atlas;
S34:将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,得到每个所述文本区域对应的所述待分析文本,其中,采用预设标签字典中的各个预设标签作为所述基于循环神经网络得到的模型的嵌入层的输出维度的预测标签,所述预设标签包括:文本和占位符。S34: Input each of the time-series feature atlases into a model based on a cyclic neural network for text recognition, and obtain the text to be analyzed corresponding to each of the text regions, wherein each preset in the preset tag dictionary is used The label is used as a predicted label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, and the preset label includes: text and a placeholder.
本实施例文本区域对应的文本图像区块进行预设高度的特征图提取,将特征图按位置进行排序生成时序特征图集,然后将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,因将特征图按位置进行排序生成时序特征图集,减少了特征图的宽度,有利于模型的训练;因图像经过基于卷积神经网络得到的模型进行特征提取后的宽度为图像从左到右被预测的次数,所以分割次数越多,越不容易漏掉字符,但是次数过多容易产生一个字符被多次识别的情况,从而导致无法判断是否有重复的字符,影响了文字识别的准确性,本实施例通过预设标签包括:文本和占位符,实现将占位符作为预设标签,相同字符用占位符分开,在解码时相同字符对应的编码之间没有占位符对应的编码,则只需要把相同的字符保留一个,从而进一步提高了文字识别的准确性。In this embodiment, the text image block corresponding to the text area is subjected to feature map extraction at a preset height, and the feature maps are sorted by position to generate a time-series feature atlas, and then each of the time-series feature atlases is input into the cyclic neural network-based The model performs text recognition. Because the feature maps are sorted by position to generate a time-series feature atlas, the width of the feature maps is reduced, which is conducive to the training of the model; because the width of the image after feature extraction by the model based on the convolutional neural network is The number of times the image is predicted from left to right, so the more the number of divisions, the less likely it is to miss characters, but too many times can easily cause a character to be recognized multiple times, resulting in the inability to judge whether there are repeated characters, which affects the Accuracy of text recognition, the present embodiment includes: text and placeholder by preset label, realizes placeholder is used as preset label, same character is separated by placeholder, there is no difference between codes corresponding to same character when decoding For the encoding corresponding to the placeholder, only one of the same characters needs to be reserved, thereby further improving the accuracy of text recognition.
对于S31,从所述待描述图像中提取与每个所述文本区域对应的图像区块,将提取的每个图像区块作为一个文本图像区块。For S31, extract image blocks corresponding to each of the text regions from the image to be described, and use each extracted image block as a text image block.
对于S32,采用基于卷积神经网络(CNN)得到的模型,对每个所述文本图像区块进行预设高度的特征图提取,也就是说,特征图的高度与预设高度相同,将提取得到的各个特征图作为特征图集。For S32, the model obtained based on the convolutional neural network (CNN) is used to extract the feature map of the preset height for each of the text image blocks, that is to say, the height of the feature map is the same as the preset height, and the extracted The obtained feature maps are used as a feature map set.
对于S33,将每个所述特征图集中的各个所述特征图按位置进行排序,从而得到一个将位置的顺序作为时序的时序特征图集。For S33, the feature maps in each feature map set are sorted by position, so as to obtain a time series feature map set with the sequence of positions as time series.
对于S34,将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,因采用预设标签字典中的各个预设标签作为所述基于循环神经网络得到的模型的嵌入层的输出维度的预测标签,所以文字识别结果的范围是预设标签 字典中的预设标签。For S34, each of the time series feature atlases is input into the model obtained based on the cyclic neural network for text recognition, because each preset label in the preset label dictionary is used as the embedding layer of the model obtained based on the cyclic neural network The predicted label of the output dimension, so the range of the text recognition result is the preset label in the preset label dictionary.
因图像经过基于卷积神经网络得到的模型进行特征提取后的宽度为图像从左到右被预测的次数,所以分割次数越多,越不容易漏掉字符,但是次数过多容易产生一个字符被多次识别的情况,从而导致无法判断是否有重复的字符,影响了文字识别的准确性,通过将占位符也作为预设标签,实现将占位符作为预设标签,相同字符用占位符分开,在解码时相同字符对应的编码之间没有占位符对应的编码,则值需要把相同的字符保留一个,从而进一步提高了文字识别的准确性。预设标签字典包括:预设标签和编码,预设标签和编码一一对应设置。Because the width of the image after feature extraction by the model based on the convolutional neural network is the number of times the image is predicted from left to right, the more the number of segmentations, the less likely it is to miss characters, but too many times are likely to cause a character to be predicted. The situation of multiple recognitions makes it impossible to judge whether there are repeated characters, which affects the accuracy of text recognition. By using the placeholder as the default label, the placeholder is used as the default label, and the same character is used as the placeholder Characters are separated, and there is no code corresponding to the placeholder between the codes corresponding to the same character during decoding, so one value needs to be reserved for the same character, thereby further improving the accuracy of text recognition. The preset label dictionary includes: preset labels and codes, and one-to-one correspondence between preset labels and codes.
可选的,将占位符对应的编码设为0。Optionally, set the encoding corresponding to the placeholder to 0.
比如,用26个字母的文本作为预设标签,将占位符对应的编码设为0,当预测的标签的编码为[2,2,2,0,15,15,0,15,15,11]时,“15,15,0,15,15”之间有0(占位符对应的编码)则可以确定是两个重复的字符,“2,2,2”中没有0则可以确定表述的是一个字符,因此确定待分析文本为book,For example, use 26-letter text as the default label, set the code corresponding to the placeholder to 0, when the code of the predicted label is [2, 2, 2, 0, 15, 15, 0, 15, 15, 11], if there is 0 between "15, 15, 0, 15, 15" (the code corresponding to the placeholder), it can be determined that it is two repeated characters, and if there is no 0 in "2, 2, 2", it can be determined The expression is a character, so it is determined that the text to be analyzed is book,
又比如,用26个字母的文本作为预设标签,将占位符对应的编码设为0,当预测的标签的编码为[0,0,2,15,15,15,15,0,11,11]时,“15,15,15,15”中没有0则可以确定表述的是一个字符,因此确定待分析文本为bok,在此举例不做具体限定。For another example, use 26-letter text as the default label, set the code corresponding to the placeholder to 0, when the code of the predicted label is [0, 0, 2, 15, 15, 15, 15, 0, 11 , 11], if there is no 0 in "15, 15, 15, 15", it can be determined that the expression is a character, so it is determined that the text to be analyzed is bok, which is not specifically limited in this example.
基于循环神经网络得到的模型,是基于双向LSTM(长短期记忆人工神经网络)网络得到的模型。The model based on the recurrent neural network is a model based on the bidirectional LSTM (long short-term memory artificial neural network) network.
在一个实施例中,上述根据所述待描述图像进行目标特征提取的步骤,包括:In one embodiment, the above-mentioned step of extracting target features according to the image to be described includes:
S41:对所述待描述图像进行图像特征图提取,得到待分析图像特征图;S41: Perform image feature map extraction on the image to be described to obtain a feature map of the image to be analyzed;
S42:采用基于区域生成网络得到的模型,根据所述待描述图像进行目标候选区域提取;S42: Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;
S43:根据每个所述目标候选区域,从所述待分析图像特征图中提取图像特征,得到目标外观特征;S43: According to each of the target candidate regions, extract image features from the image feature map to be analyzed to obtain target appearance features;
S44:根据每个所述区域图像特征进行分类预测,得到分类预测结果,其中,分类预测的分类标签包括:多个物体标签和一个背景标签;S44: Perform classification prediction according to the image features of each region to obtain a classification prediction result, wherein the classification prediction classification labels include: multiple object labels and a background label;
S45:根据每个所述目标外观特征进行位置回归处理,得到目标位置信息;S45: Perform position regression processing according to each target appearance feature to obtain target position information;
S46:采用基于全卷积网络得到的模型,根据每个所述目标外观特征进行掩膜图生成,得到目标掩膜图;S46: Using a model obtained based on a fully convolutional network, generating a mask map according to each of the target appearance features, to obtain a target mask map;
S47:将同一所述目标候选区域对应的所述目标外观特征、所述目标位置信息和所述目标掩膜图作为一个所述目标特征。S47: Using the target appearance feature, the target position information and the target mask map corresponding to the same target candidate area as one target feature.
本实施例基于ResNet-FPN得到的模型、基于ROI Align方法、基于全卷积网络得到的模型提取出了每个目标特征,目标特征包括所述目标外观特征、所述目标位置信息和所述目标掩膜图,为后续理解目标之间的相互关系提供了基础,为基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述提供了基础。In this embodiment, each target feature is extracted based on the model obtained by ResNet-FPN, based on the ROI Align method, and based on the model obtained by the full convolution network. The target features include the target appearance feature, the target position information and the target. The mask map provides the basis for the subsequent understanding of the relationship between objects, and provides a basis for the method based on multi-modal feature fusion to understand the text-connected environment of the understood image to generate an image description.
对于S41,采用基于ResNet-FPN得到的模型,对所述待描述图像进行图像特征提取,将提取得到的图像特征作为待分析图像特征图。For S41, use a model obtained based on ResNet-FPN to extract image features from the image to be described, and use the extracted image features as a feature map of the image to be analyzed.
FPN,全称为Feature Pyramid Network,是一种通用架构,可以结合各种骨架网络使用,比如VGG(深度卷积神经网络),ResNet(残差网络)等。FPN, the full name of Feature Pyramid Network, is a general-purpose architecture that can be used in combination with various skeleton networks, such as VGG (deep convolutional neural network), ResNet (residual network), etc.
对于S42,采用基于区域生成网络得到的模型,根据所述待描述图像进行目标候选区域提取,其中,基于区域生成网络得到的模型采用预设尺寸的框(也称为先验框,Anchor),预测属于目标的候选区域,将预测的每个候选区域作为一 个目标候选区域。For S42, the model obtained based on the region generation network is used, and the target candidate region is extracted according to the image to be described, wherein the model obtained based on the region generation network adopts a frame of a preset size (also called a priori frame, Anchor), Predict the candidate regions belonging to the target, and use each predicted candidate region as an object candidate region.
基于区域生成网络得到的模型,也就是基于RPN(RegionProposal Network)得到的模型。The model obtained based on the region generation network is the model obtained based on the RPN (RegionProposal Network).
对于S43,基于ROI Align方法,从所述待分析图像特征图中提取与每个所述目标候选区域对应的图像特征图,将每个所述目标候选区域对应的图像特征作为目标外观特征。For S43, based on the ROI Align method, an image feature map corresponding to each of the target candidate regions is extracted from the image feature map to be analyzed, and the image features corresponding to each of the target candidate regions are used as target appearance features.
ROI Align,是区域特征聚集方法。ROI Align is a regional feature aggregation method.
其中,ROI Align方法首先将RoI(实线部分)映射到feature map(虚线部分)上;然后将feature map均分为2*2个单元,设每个单元内包含四个采样点,每个采样点都不是整数坐标,因此这里需要用双线性插值对每个采样点进行估值;最后在每个单元内进行池化计算,每个单元得到一个值,从而得到2*2的feature map。Among them, the ROI Align method first maps the RoI (solid line part) to the feature map (dotted line part); then divides the feature map into 2*2 units, and assumes that each unit contains four sampling points, and each sampling Points are not integer coordinates, so bilinear interpolation is needed to estimate each sampling point; finally, pooling calculation is performed in each unit, and each unit gets a value, thus obtaining a 2*2 feature map.
对于S44,因分类预测的分类标签包括:多个物体标签和一个背景标签,根据每个所述区域图像特征进行分类预测,以预测所述区域图像特征对应的图像是背景还是物体。可以理解的是,这里预测的物体是预测具体为哪种物体。For S44, the classification labels predicted by classification include: a plurality of object labels and a background label, and classification prediction is performed according to each of the region image features, so as to predict whether the image corresponding to the region image features is a background or an object. It can be understood that the object predicted here refers to what kind of object is predicted.
可选的,采用基于FC layer(全连接层)和softmax激活函数(回归分类函数)得到的模型,根据每个所述区域图像特征进行分类预测,得到分类预测结果。Optionally, a model based on FC layer (full connection layer) and softmax activation function (regression classification function) is used to perform classification prediction according to the image features of each region, and obtain classification prediction results.
对于S45,根据每个所述目标外观特征进行目标的位置回归处理,将预测的位置信息作为目标位置信息。For S45, perform target position regression processing according to each target appearance feature, and use the predicted position information as the target position information.
可选的,采用基于FC layer和bbox regressor得到的模型,根据每个所述目标外观特征进行位置回归处理。Optionally, use a model based on FC layer and bbox regressor to perform position regression processing according to each target appearance feature.
bbox regressor,也就是边框回归器。bbox regressor, which is the border regressor.
目标位置信息包括:也就是目标对应的检测框的位置信息。The target position information includes: that is, the position information of the detection frame corresponding to the target.
对于S46,采用基于全卷积网络得到的模型,根据每个所述目标外观特征进行逐个像素的掩膜预测,根据掩膜预测结果得到目标掩膜图。For S46, the model obtained based on the fully convolutional network is used to perform pixel-by-pixel mask prediction according to each of the target appearance features, and the target mask map is obtained according to the mask prediction results.
对于S47,将同一所述目标候选区域对应的所述目标外观特征、所述目标位置信息和所述目标掩膜图作为一个所述目标特征,也就是说,所述目标特征中的所述目标外观特征、所述目标位置信息和所述目标掩膜图均对应同一个目标。可以理解的是,所述目标特征的数量可以为一个,也可以为多个。For S47, the target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature, that is, the target in the target feature The appearance feature, the target position information and the target mask map all correspond to the same target. It can be understood that the number of the target feature may be one or more.
在一个实施例中,上述基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果的步骤,包括:In one embodiment, the above-mentioned method based on multimodal feature fusion, performing image description generation according to the image to be described, each of the texts to be analyzed and each of the target features, and the step of obtaining the image description result include:
S51:根据每个所述目标特征进行特征融合,得到第一融合特征;S51: Perform feature fusion according to each of the target features to obtain a first fusion feature;
S52:根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征;S52: Perform feature fusion on each text to be analyzed according to the image to be described to obtain a second fusion feature;
S53:采用基于迭代的Transformer得到的模型,根据各个所述第一融合特征和各个所述第二融合特征进行词预测,得到词预测结果;S53: Using the model obtained by the iterative Transformer, perform word prediction according to each of the first fusion features and each of the second fusion features, and obtain a word prediction result;
S54:采用基于动态指针网络得到的模型,根据所述词预测结果和各个所述待分析文本进行图像描述生成,得到所述图像描述结果。S54: Use the model obtained based on the dynamic pointer network to generate an image description according to the word prediction result and each text to be analyzed, and obtain the image description result.
本实施例先对目标特征进行特征融合,然后对待分析文本进行特征融合,最后采用基于迭代的Transformer得到的模型根据两个特征融合的结果进行词预测,最后根据词预测的结果和待分析文本进行图像描述生成,实现了基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。In this embodiment, feature fusion is first performed on the target features, and then feature fusion is performed on the text to be analyzed. Finally, word prediction is performed based on the result of two feature fusions using the model obtained based on iterative Transformer, and finally the word prediction is performed based on the result of word prediction and the text to be analyzed. Image description generation realizes the method based on multi-modal feature fusion to understand the text of the understood image in relation to the environment to generate an image description, thereby expressing the rich information of the image in detail and completely, and improving the accuracy of image description .
对于S51,将各个所述目标特征对应的各种特征映射到经过学习得到的通用向量嵌入空间,将映射后的特征作为第一融合特征。For S51, various features corresponding to each of the target features are mapped to the general vector embedding space obtained through learning, and the mapped features are used as the first fusion feature.
对于S52,根据所述待描述图像,将各个所述待分析文本对应的各种特征映射到经过学习得到的通用向量嵌入空间,将映射后的特征作为第二融合特征。For S52, according to the image to be described, various features corresponding to each text to be analyzed are mapped to a general vector embedding space obtained through learning, and the mapped features are used as the second fusion feature.
对于S53,基于迭代的Transformer得到的模型的输入包括:所述第一融合特征、所述第二融合特征和上一个时刻预测出的词。比如,在预测第3个词时,预测得到的第2个词将作为上一个时刻预测出的词输入基于迭代的Transformer得到的模型。For S53, the input of the model obtained based on the iterative Transformer includes: the first fusion feature, the second fusion feature and the word predicted at the previous moment. For example, when predicting the third word, the predicted second word will be input into the model based on iterative Transformer as the word predicted at the previous moment.
Transformer的结构也是由encoder(编码)和decoder(解码)组成,并且Transformer是利用Attention(注意力)机制来解决自然语言翻译问题。The structure of Transformer is also composed of encoder (encoding) and decoder (decoding), and Transformer uses Attention (attention) mechanism to solve the problem of natural language translation.
其中,所述第一融合特征、所述第二融合特征和历史预测出的词输入基于迭代的Transformer得到的模型得到矩阵shape,shape=(M+N+P,d),d是预设维度,M是历史预测的词,N是当前时刻预测的词,P是当前时刻的时间步,标识已经预测的词的数量,P的初始值为1。Wherein, the first fusion feature, the second fusion feature and the historically predicted word input are obtained based on the iterative Transformer model to obtain a matrix shape, shape=(M+N+P, d), and d is a preset dimension , M is the word predicted in history, N is the word predicted at the current moment, P is the time step at the current moment, and identifies the number of words that have been predicted, and the initial value of P is 1.
词预测结果中包括一个或多个词。One or more words are included in the word prediction result.
对于S54,基于动态指针网络得到的模型,也就是基于DPN(Dynamic pointer network)得到的模型。For S54, the model based on the dynamic pointer network is the model based on DPN (Dynamic pointer network).
采用基于动态指针网络得到的模型,对所述词预测结果和各个所述待分析文本进行概率分布的建模,以用于决定从所述词预测结果和所述待分析文本中的哪个中选择词出来作为图像描述的词。可以理解的是,将采用基于动态指针网络得到的模型选择的词进行句子拼接,将拼接得到的句子作为所述图像描述结果。Using a model obtained based on a dynamic pointer network, perform probability distribution modeling on the word prediction result and each of the texts to be analyzed, so as to determine which of the word prediction results and the text to be analyzed is to be selected. Words come out as words for image descriptions. It can be understood that the words selected by the model based on the dynamic pointer network are used for sentence splicing, and the spliced sentence is used as the image description result.
在一个实施例中,上述根据每个所述目标特征进行特征融合,得到第一融合特征的步骤,包括:In one embodiment, the above-mentioned step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:
S511:获取一个所述目标特征作为待处理目标特征;S511: Obtain one target feature as the target feature to be processed;
S512:对所述待处理目标特征中的目标位置信息进行编码,得到目标位置编码;S512: Encode the target position information in the target feature to be processed to obtain the target position code;
S513:对所述待处理目标特征中的目标外观特征和所述目标位置编码进行线性变化边以映射到第一预设维数的向量嵌入空间,得到所述待处理目标特征对应的所述第一融合特征;S513: Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first value corresponding to the target feature to be processed. a fusion feature;
S514:重复执行所述获取一个所述目标特征作为待处理目标特征的步骤,直至完成所述目标特征的获取。S514: Repeat the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
本实施例将目标特征的各种特征进行线性变化边以映射到第一预设维数的向量嵌入空间,从而为实现了基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述提供了基础。In this embodiment, the various features of the target feature are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the understanding of the text-related environment of the understood image based on the method of multi-modal feature fusion provides the basis for generating image descriptions.
对于S511,从各个所述目标特征中获取任一个所述目标特征,将获取的所述目标特征作为待处理目标特征。For S511, any one of the target features is acquired from each of the target features, and the acquired target feature is used as the target feature to be processed.
对于S512,对所述待处理目标特征中的目标位置信息进行编码,得到目标位置编码。For S512, encode the target position information in the target feature to be processed to obtain the target position code.
目标位置编码表述为
Figure PCTCN2022090158-appb-000002
xmin是横向像素点位置最小值,ymin是纵向像素点位置最小值,xmax是横向像素点位置最大值,ymax是纵向像素点位置最大值,Wim是宽度,Him是高度。
The target location code is expressed as
Figure PCTCN2022090158-appb-000002
xmin is the minimum value of the horizontal pixel position, ymin is the minimum value of the vertical pixel position, xmax is the maximum value of the horizontal pixel position, ymax is the maximum value of the vertical pixel position, Wim is the width, and Him is the height.
对于S513,对所述待处理目标特征中的目标外观特征和所述目标位置编码进行线性变化边以映射到第一预设维数的向量嵌入空间,采用公式
Figure PCTCN2022090158-appb-000003
表述为:
For S513, linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, using the formula
Figure PCTCN2022090158-appb-000003
Expressed as:
Figure PCTCN2022090158-appb-000004
Figure PCTCN2022090158-appb-000004
其中,LN是Linear Normalization,也就是线性的归一化处理;W1和W2是预设常量,
Figure PCTCN2022090158-appb-000005
是所述待处理目标特征中的目标外观特征,
Figure PCTCN2022090158-appb-000006
是所述目标位置编码。
Among them, LN is Linear Normalization, that is, linear normalization processing; W1 and W2 are preset constants,
Figure PCTCN2022090158-appb-000005
is the target appearance feature in the target features to be processed,
Figure PCTCN2022090158-appb-000006
is the target location code.
公式
Figure PCTCN2022090158-appb-000007
的计算结果即为所述待处理目标特征对应的所述第一融合特征。
formula
Figure PCTCN2022090158-appb-000007
The calculation result of is the first fusion feature corresponding to the target feature to be processed.
对于S514,重复执行步骤S511至步骤S514,直至完成所述目标特征的获取。当完成所述目标特征的获取时,意味着完成了每个目标特征的特征融合。For S514, step S511 to step S514 are repeatedly executed until the acquisition of the target feature is completed. When the acquisition of the target features is completed, it means that the feature fusion of each target feature is completed.
在一个实施例中,上述根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征的步骤,包括:In one embodiment, the above-mentioned step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain the second fusion feature includes:
S521:获取任一个所述待分析文本作为待处理文本;S521: Obtain any one of the text to be analyzed as the text to be processed;
S522:根据所述待处理文本进行第二预设维数的词向量编码,得到文本块词向量;S522: Perform word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;
S523:根据所述待处理文本,从所述待描述图像中进行图像特征提取,得到文本块图像特征;S523: According to the text to be processed, perform image feature extraction from the image to be described to obtain text block image features;
S524:对所述待处理文本进行第三预设维数的文本编码,得到文本块编码特征;S524: Perform text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;
S525:根据所述待描述图像,对所述待处理文本进行位置信息确定,得到文本位置信息;S525: Determine the position information of the text to be processed according to the image to be described, and obtain text position information;
S526:对所述文本位置信息进行编码,得到文本位置编码;S526: Encode the text position information to obtain a text position code;
S527:对所述文本块词向量、所述文本块图像特征、所述文本块编码特征和所述文本位置编码进行线性变化边以映射到第四预设维数的向量嵌入空间,得到所述待处理文本对应的所述第二融合特征;S527: Perform a linear change on the text block word vector, the text block image feature, the text block encoding feature, and the text position encoding to map to a vector embedding space of a fourth preset dimension, to obtain the The second fusion feature corresponding to the text to be processed;
S528:重复执行所述获取任一个所述待分析文本作为待处理文本的步骤,直至完成所述待分析文本的获取。S528: Repeat the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.
本实施例将待分析文本的各种特征进行线性变化边以映射到第一预设维数的向量嵌入空间,从而为实现了基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述提供了基础。In this embodiment, the various features of the text to be analyzed are linearly changed to map to the vector embedding space of the first preset dimension, so as to realize the text connection environment of the understood image in order to realize the method based on multi-modal feature fusion Understanding to generate image descriptions provides the basis.
对于S521,从各个所述待分析文本中获取任一个所述待分析文本,将获取的所述待分析文本作为待处理文本。For S521, acquire any one of the texts to be analyzed from each of the texts to be analyzed, and use the acquired texts to be analyzed as the texts to be processed.
对于S522,根据所述待处理文本进行第二预设维数的FastText(浅层网络)词向量编码,将编码得到的FastText词向量作为文本块词向量。For S522, perform FastText (shallow network) word vector encoding of a second preset dimension according to the text to be processed, and use the encoded FastText word vector as a text block word vector.
可选的,第二预设维数设为300。Optionally, the second preset dimension is set to 300.
FastText词向量,是采用FastText生成的词向量。FastText是词向量计算和文本分类工具。FastText word vector is a word vector generated by FastText. FastText is a word vector calculation and text classification tool.
对于S523,采用基于Faster RCNN(目标检测算法)得到的模型,从所述待描述图像中提取出与所述待处理文本对应的图像特征作为文本块图像特征。For S523, using a model obtained based on Faster RCNN (target detection algorithm), extracting image features corresponding to the text to be processed from the image to be described as text block image features.
对于S524,采用PHOC(Pyramidal Histogram of Characters)编码方法,对所述待处理文本进行第三预设维数的文本编码,将编码得到的数据作为文本块编码特征。For S524, the PHOC (Pyramidal Histogram of Characters) encoding method is used to perform text encoding of the third preset dimension on the text to be processed, and use the encoded data as the text block encoding feature.
可选的,第三预设维数设为604维。Optionally, the third preset dimension is set to 604 dimensions.
对于S525,根据所述待描述图像,对所述待处理文本对应的文本区域进行位置信息确定,将确定的位置信息作为文本位置信息。For S525, according to the image to be described, determine the position information of the text region corresponding to the text to be processed, and use the determined position information as the text position information.
文本位置信息,是所述待处理文本对应的文本区域在所述待描述图像中的位置信息。The text location information is the location information of the text area corresponding to the text to be processed in the image to be described.
对于S526,对所述文本位置信息进行编码,将编码得到的数据作为文本位置编码。For S526, encode the text position information, and use the encoded data as the text position code.
对于S527,对所述文本块词向量、所述文本块图像特征、所述文本块编码特征和所述文本位置编码进行线性变化边以映射到第四预设维数的向量嵌入空间,采用公式
Figure PCTCN2022090158-appb-000008
表述为:
For S527, linearly change the word vector of the text block, the image feature of the text block, the coding feature of the text block and the text position coding to map to the vector embedding space of the fourth preset dimension, using the formula
Figure PCTCN2022090158-appb-000008
Expressed as:
Figure PCTCN2022090158-appb-000009
Figure PCTCN2022090158-appb-000009
其中,LN是Linear Normalization,也就是线性的归一化处理;W3、W5和W6是预设常量,
Figure PCTCN2022090158-appb-000010
是所述文本块词向量,
Figure PCTCN2022090158-appb-000011
是所述文本块图像特征,
Figure PCTCN2022090158-appb-000012
是所述文本块编码特征,
Figure PCTCN2022090158-appb-000013
是所述文本位置编码。
Among them, LN is Linear Normalization, that is, linear normalization processing; W3, W5 and W6 are preset constants,
Figure PCTCN2022090158-appb-000010
is the text block word vector,
Figure PCTCN2022090158-appb-000011
is the image feature of the text block,
Figure PCTCN2022090158-appb-000012
is the text block encoding feature,
Figure PCTCN2022090158-appb-000013
is the text position code.
公式
Figure PCTCN2022090158-appb-000014
计算结果即为所述待处理文本对应的所述第二融合特征。
formula
Figure PCTCN2022090158-appb-000014
The calculation result is the second fusion feature corresponding to the text to be processed.
对于S528,重复执行步骤S521至步骤S528,直至完成所述待分析文本的获取。当完成所述待分析文本的获取时,意味着完成了每个所述待分析文本的特征融合。For S528, step S521 to step S528 are repeatedly executed until the acquisition of the text to be analyzed is completed. When the acquisition of the text to be analyzed is completed, it means that the feature fusion of each text to be analyzed is completed.
参照图2,本申请还提出了一种基于人工智能的图像描述生成装置,所述装置包括:Referring to Fig. 2, the present application also proposes a device for generating image descriptions based on artificial intelligence, the device comprising:
图像获取模块100,用于获取待描述图像;An image acquisition module 100, configured to acquire an image to be described;
文本区域检测模块200,用于根据所述待描述图像进行文本区域检测;A text area detection module 200, configured to perform text area detection according to the image to be described;
文本识别模块300,用于根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;A text recognition module 300, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
目标特征提取模块400,用于根据所述待描述图像进行目标特征提取;A target feature extraction module 400, configured to extract target features according to the image to be described;
图像描述生成模块500,用于基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。The image description generating module 500 is configured to generate an image description based on the multimodal feature fusion method according to the image to be described, each text to be analyzed and each target feature, and obtain an image description result.
本实施例首先根据所述待描述图像进行文本区域检测,根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本,然后根据所述待描述图像进行目标特征提取,最后基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。通过基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而实现将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。In this embodiment, firstly, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存基于人工智能的图像描述生成方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于人工智能的图像描述生成方法,其中,所述方法包括步骤:Referring to FIG. 3 , an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as an image description generation method based on artificial intelligence. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, an image description generation method based on artificial intelligence is realized, wherein the method includes the steps of:
获取待描述图像;Get the image to be described;
根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
本实施例首先根据所述待描述图像进行文本区域检测,根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本,然后根据所述待描述图像进行目标特征提取,最后基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。通过基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而实现将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。In this embodiment, firstly, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described to obtain the text to be analyzed, and then target feature extraction is performed according to the image to be described, Finally, based on the multimodal feature fusion method, an image description is generated according to the image to be described, each text to be analyzed, and each target feature, and an image description result is obtained. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种基于人工智能的图像描述生成方法,其中,所述方法包括步骤:An embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, an image description generation method based on artificial intelligence is implemented, wherein the method includes steps:
获取待描述图像;Get the image to be described;
根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
上述执行的基于人工智能的图像描述生成方法,首先根据所述待描述图像进行文本区域检测,根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本,然后根据所述待描述图像进行目标特征提取,最后基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。通过基于多模态特征融合的方法将理解的图像的文本联系环境进行理解以生成图像描述,从而实现将图像的丰富信息用语言详尽完整地表达出来,提高了图像描述的准确性。In the image description generation method based on artificial intelligence performed above, first, text region detection is performed according to the image to be described, and text recognition is performed on each text region according to the image to be described, to obtain the text to be analyzed, and then according to the image to be described, Extract the target feature from the image to be described, and finally generate an image description based on the image to be described, each of the texts to be analyzed and each of the target features based on the method of multimodal feature fusion, and obtain an image description result. Through the method based on multi-modal feature fusion, the text-connected environment of the understood image is understood to generate an image description, so that the rich information of the image can be expressed in detail and completely in language, and the accuracy of image description is improved.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,也可以是易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory The medium may also be a volatile computer-readable storage medium. When the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media provided in the present application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包 括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, apparatus, article or method comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, apparatus, article, or method. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, device, article or method that includes that element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the application, and are not intended to limit the patent scope of the application. Any equivalent structure or equivalent process transformation made by using the specification and drawings of the application, or directly or indirectly used in other related All technical fields are equally included in the patent protection scope of the present application.

Claims (20)

  1. 一种基于人工智能的图像描述生成方法,其中,所述方法包括:A method for generating image descriptions based on artificial intelligence, wherein the method includes:
    获取待描述图像;Get the image to be described;
    根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
    根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
    根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
    基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  2. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述根据所述待描述图像进行文本区域检测的步骤,包括:The artificial intelligence-based image description generation method according to claim 1, wherein the step of performing text region detection according to the image to be described comprises:
    对所述待描述图像进行下采样处理,得到下采样特征;performing downsampling processing on the image to be described to obtain downsampling features;
    对所述下采样特征进行上采样处理,得到上采样特征;Performing upsampling processing on the downsampled features to obtain upsampled features;
    对所述上采样特征进行级联处理,得到待分析特征层;Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;
    根据所述待分析特征层进行文本概率图预测,得到目标文本概率图;Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;
    根据所述待分析特征层进行动态阈值图预测,得到目标动态阈值图;Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;
    根据所述目标文本概率图和所述目标动态阈值图进行可微分二值化计算,得到可微分二值化图;performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;
    根据所述可微分二值化图进行所述文本区域生成。The text region is generated according to the differentiable binarization map.
  3. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本的步骤,包括:The artificial intelligence-based image description generation method according to claim 1, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:
    根据每个所述文本区域,从所述待描述图像中提取图像区块,得到文本图像区块;Extracting image blocks from the image to be described according to each text region to obtain text image blocks;
    采用基于卷积神经网络得到的模型,对每个所述文本图像区块进行预设高度的特征图提取,得到特征图集;Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;
    将每个所述特征图集中的各个所述特征图按位置进行排序,得到时序特征图集;Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;
    将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,得到每个所述文本区域对应的所述待分析文本,其中,采用预设标签字典中的各个预设标签作为所述基于循环神经网络得到的模型的嵌入层的输出维度的预测标签,所述预设标签包括:文本和占位符。Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
  4. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述根据所述待描述图像进行目标特征提取的步骤,包括:The image description generation method based on artificial intelligence according to claim 1, wherein the step of extracting target features according to the image to be described comprises:
    对所述待描述图像进行图像特征图提取,得到待分析图像特征图;Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;
    采用基于区域生成网络得到的模型,根据所述待描述图像进行目标候选区域提取;Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;
    根据每个所述目标候选区域,从所述待分析图像特征图中提取图像特征,得到目标外观特征;According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;
    根据每个所述区域图像特征进行分类预测,得到分类预测结果,其中,分类预测的分类标签包括:多个物体标签和一个背景标签;Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;
    根据每个所述目标外观特征进行位置回归处理,得到目标位置信息;Perform position regression processing according to each target appearance feature to obtain target position information;
    采用基于全卷积网络得到的模型,根据每个所述目标外观特征进行掩膜图生成,得到目标掩膜图;Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;
    将同一所述目标候选区域对应的所述目标外观特征、所述目标位置信息和所 述目标掩膜图作为一个所述目标特征。The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
  5. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果的步骤,包括:The image description generation method based on artificial intelligence according to claim 1, wherein, in the method based on multimodal feature fusion, an image is performed according to the image to be described, each of the text to be analyzed, and each of the target features Description generation, the steps of obtaining image description results, including:
    根据每个所述目标特征进行特征融合,得到第一融合特征;performing feature fusion according to each of the target features to obtain a first fusion feature;
    根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征;performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;
    采用基于迭代的Transformer得到的模型,根据各个所述第一融合特征和各个所述第二融合特征进行词预测,得到词预测结果;Using the model obtained based on the iterative Transformer, performing word prediction according to each of the first fusion features and each of the second fusion features, to obtain a word prediction result;
    采用基于动态指针网络得到的模型,根据所述词预测结果和各个所述待分析文本进行图像描述生成,得到所述图像描述结果。A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.
  6. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述根据每个所述目标特征进行特征融合,得到第一融合特征的步骤,包括:The method for generating image descriptions based on artificial intelligence according to claim 1, wherein the step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:
    获取一个所述目标特征作为待处理目标特征;Obtaining one of the target features as the target feature to be processed;
    对所述待处理目标特征中的目标位置信息进行编码,得到目标位置编码;Encoding the target position information in the target feature to be processed to obtain the target position code;
    对所述待处理目标特征中的目标外观特征和所述目标位置编码进行线性变化边以映射到第一预设维数的向量嵌入空间,得到所述待处理目标特征对应的所述第一融合特征;Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first fusion corresponding to the target feature to be processed feature;
    重复执行所述获取一个所述目标特征作为待处理目标特征的步骤,直至完成所述目标特征的获取。Repeating the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
  7. 根据权利要求1所述的基于人工智能的图像描述生成方法,其中,所述根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征的步骤,包括:The image description generation method based on artificial intelligence according to claim 1, wherein the step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain a second fusion feature includes:
    获取任一个所述待分析文本作为待处理文本;Obtain any one of the texts to be analyzed as texts to be processed;
    根据所述待处理文本进行第二预设维数的词向量编码,得到文本块词向量;performing word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;
    根据所述待处理文本,从所述待描述图像中进行图像特征提取,得到文本块图像特征;According to the text to be processed, image feature extraction is performed from the image to be described to obtain image features of text blocks;
    对所述待处理文本进行第三预设维数的文本编码,得到文本块编码特征;performing text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;
    根据所述待描述图像,对所述待处理文本进行位置信息确定,得到文本位置信息;Determining the position information of the text to be processed according to the image to be described to obtain text position information;
    对所述文本位置信息进行编码,得到文本位置编码;Encoding the text position information to obtain a text position code;
    对所述文本块词向量、所述文本块图像特征、所述文本块编码特征和所述文本位置编码进行线性变化边以映射到第四预设维数的向量嵌入空间,得到所述待处理文本对应的所述第二融合特征;The text block word vector, the text block image feature, the text block coding feature and the text position code are linearly changed to map to the vector embedding space of the fourth preset dimension, and the to-be-processed The second fusion feature corresponding to the text;
    重复执行所述获取任一个所述待分析文本作为待处理文本的步骤,直至完成所述待分析文本的获取。Repeating the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.
  8. 一种基于人工智能的图像描述生成装置,其中,所述装置包括:A device for generating image description based on artificial intelligence, wherein the device includes:
    图像获取模块,用于获取待描述图像;An image acquisition module, configured to acquire an image to be described;
    文本区域检测模块,用于根据所述待描述图像进行文本区域检测;A text area detection module, configured to perform text area detection according to the image to be described;
    文本识别模块,用于根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;A text recognition module, configured to perform text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
    目标特征提取模块,用于根据所述待描述图像进行目标特征提取;A target feature extraction module, configured to extract target features according to the image to be described;
    图像描述生成模块,用于基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。The image description generation module is used to generate an image description based on the multi-modal feature fusion method according to the image to be described, each text to be analyzed and each target feature, and obtain an image description result.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现基于人工智能的图像描述生成方法;A computer device, comprising a memory and a processor, the memory stores a computer program, wherein when the processor executes the computer program, an image description generation method based on artificial intelligence is realized;
    其中,所述基于人工智能的图像描述生成方法包括步骤:Wherein, the image description generation method based on artificial intelligence comprises steps:
    获取待描述图像;Get the image to be described;
    根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
    根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
    根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
    基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the multimodal feature fusion method, an image description is generated according to the image to be described, each of the texts to be analyzed and each of the target features, and an image description result is obtained.
  10. 根据权利要求9所述的计算机设备,其中,所述根据所述待描述图像进行文本区域检测的步骤,包括:The computer device according to claim 9, wherein the step of performing text region detection according to the image to be described comprises:
    对所述待描述图像进行下采样处理,得到下采样特征;performing downsampling processing on the image to be described to obtain downsampling features;
    对所述下采样特征进行上采样处理,得到上采样特征;Performing upsampling processing on the downsampled features to obtain upsampled features;
    对所述上采样特征进行级联处理,得到待分析特征层;Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;
    根据所述待分析特征层进行文本概率图预测,得到目标文本概率图;Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;
    根据所述待分析特征层进行动态阈值图预测,得到目标动态阈值图;Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;
    根据所述目标文本概率图和所述目标动态阈值图进行可微分二值化计算,得到可微分二值化图;performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;
    根据所述可微分二值化图进行所述文本区域生成。The text region is generated according to the differentiable binarization map.
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本的步骤,包括:The computer device according to claim 9, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:
    根据每个所述文本区域,从所述待描述图像中提取图像区块,得到文本图像区块;Extracting image blocks from the image to be described according to each text region to obtain text image blocks;
    采用基于卷积神经网络得到的模型,对每个所述文本图像区块进行预设高度的特征图提取,得到特征图集;Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;
    将每个所述特征图集中的各个所述特征图按位置进行排序,得到时序特征图集;Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;
    将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,得到每个所述文本区域对应的所述待分析文本,其中,采用预设标签字典中的各个预设标签作为所述基于循环神经网络得到的模型的嵌入层的输出维度的预测标签,所述预设标签包括:文本和占位符。Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
  12. 根据权利要求9所述的计算机设备,其中,所述根据所述待描述图像进行目标特征提取的步骤,包括:The computer device according to claim 9, wherein the step of extracting target features according to the image to be described comprises:
    对所述待描述图像进行图像特征图提取,得到待分析图像特征图;Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;
    采用基于区域生成网络得到的模型,根据所述待描述图像进行目标候选区域提取;Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;
    根据每个所述目标候选区域,从所述待分析图像特征图中提取图像特征,得到目标外观特征;According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;
    根据每个所述区域图像特征进行分类预测,得到分类预测结果,其中,分类预测的分类标签包括:多个物体标签和一个背景标签;Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;
    根据每个所述目标外观特征进行位置回归处理,得到目标位置信息;Perform position regression processing according to each target appearance feature to obtain target position information;
    采用基于全卷积网络得到的模型,根据每个所述目标外观特征进行掩膜图生成,得到目标掩膜图;Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;
    将同一所述目标候选区域对应的所述目标外观特征、所述目标位置信息和所述目标掩膜图作为一个所述目标特征。The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
  13. 根据权利要求9所述的计算机设备,其中,所述基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果的步骤,包括:The computer device according to claim 9, wherein, in the method based on multimodal feature fusion, an image description is generated according to the image to be described, each of the texts to be analyzed, and each of the target features to obtain an image description Resulting steps, including:
    根据每个所述目标特征进行特征融合,得到第一融合特征;performing feature fusion according to each of the target features to obtain a first fusion feature;
    根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征;performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;
    采用基于迭代的Transformer得到的模型,根据各个所述第一融合特征和各个所述第二融合特征进行词预测,得到词预测结果;Using the model obtained based on the iterative Transformer, performing word prediction according to each of the first fusion features and each of the second fusion features, to obtain a word prediction result;
    采用基于动态指针网络得到的模型,根据所述词预测结果和各个所述待分析文本进行图像描述生成,得到所述图像描述结果。A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.
  14. 根据权利要求9所述的计算机设备,其中,所述根据每个所述目标特征进行特征融合,得到第一融合特征的步骤,包括:The computer device according to claim 9, wherein the step of performing feature fusion according to each of the target features to obtain the first fusion feature includes:
    获取一个所述目标特征作为待处理目标特征;Obtaining one of the target features as the target feature to be processed;
    对所述待处理目标特征中的目标位置信息进行编码,得到目标位置编码;Encoding the target position information in the target feature to be processed to obtain the target position code;
    对所述待处理目标特征中的目标外观特征和所述目标位置编码进行线性变化边以映射到第一预设维数的向量嵌入空间,得到所述待处理目标特征对应的所述第一融合特征;Linearly change the target appearance feature and the target position code in the target feature to be processed to map to the vector embedding space of the first preset dimension, and obtain the first fusion corresponding to the target feature to be processed feature;
    重复执行所述获取一个所述目标特征作为待处理目标特征的步骤,直至完成所述目标特征的获取。Repeating the step of acquiring one target feature as the target feature to be processed until the acquisition of the target feature is completed.
  15. 根据权利要求9所述的计算机设备,其中,所述根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征的步骤,包括:The computer device according to claim 9, wherein the step of performing feature fusion on each of the texts to be analyzed according to the image to be described to obtain a second fusion feature includes:
    获取任一个所述待分析文本作为待处理文本;Obtain any one of the texts to be analyzed as texts to be processed;
    根据所述待处理文本进行第二预设维数的词向量编码,得到文本块词向量;performing word vector encoding of a second preset dimension according to the text to be processed to obtain a text block word vector;
    根据所述待处理文本,从所述待描述图像中进行图像特征提取,得到文本块图像特征;According to the text to be processed, image feature extraction is performed from the image to be described to obtain image features of text blocks;
    对所述待处理文本进行第三预设维数的文本编码,得到文本块编码特征;performing text encoding of a third preset dimension on the text to be processed to obtain text block encoding features;
    根据所述待描述图像,对所述待处理文本进行位置信息确定,得到文本位置信息;Determining the position information of the text to be processed according to the image to be described to obtain text position information;
    对所述文本位置信息进行编码,得到文本位置编码;Encoding the text position information to obtain a text position code;
    对所述文本块词向量、所述文本块图像特征、所述文本块编码特征和所述文本位置编码进行线性变化边以映射到第四预设维数的向量嵌入空间,得到所述待处理文本对应的所述第二融合特征;The text block word vector, the text block image feature, the text block coding feature and the text position code are linearly changed to map to the vector embedding space of the fourth preset dimension, and the to-be-processed The second fusion feature corresponding to the text;
    重复执行所述获取任一个所述待分析文本作为待处理文本的步骤,直至完成所述待分析文本的获取。Repeating the step of acquiring any one of the texts to be analyzed as the texts to be processed until the acquisition of the texts to be analyzed is completed.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时,实现基于人工智能的图像描述生成方法;A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, an image description generation method based on artificial intelligence is realized;
    其中,所述基于人工智能的图像描述生成方法包括步骤:Wherein, the image description generation method based on artificial intelligence comprises steps:
    获取待描述图像;Get the image to be described;
    根据所述待描述图像进行文本区域检测;performing text region detection according to the image to be described;
    根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本;performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed;
    根据所述待描述图像进行目标特征提取;performing target feature extraction according to the image to be described;
    基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和 各个所述目标特征进行图像描述生成,得到图像描述结果。Based on the method of multimodal feature fusion, image description generation is performed according to the image to be described, each of the text to be analyzed and each of the target features, and an image description result is obtained.
  17. 根据权利要求16所述的计算机可读存储介质,其中,其中,所述根据所述待描述图像进行文本区域检测的步骤,包括:The computer-readable storage medium according to claim 16, wherein, the step of performing text region detection according to the image to be described comprises:
    对所述待描述图像进行下采样处理,得到下采样特征;performing downsampling processing on the image to be described to obtain downsampling features;
    对所述下采样特征进行上采样处理,得到上采样特征;Performing upsampling processing on the downsampled features to obtain upsampled features;
    对所述上采样特征进行级联处理,得到待分析特征层;Carrying out cascading processing on the upsampled features to obtain the feature layer to be analyzed;
    根据所述待分析特征层进行文本概率图预测,得到目标文本概率图;Predicting the text probability map according to the feature layer to be analyzed to obtain the target text probability map;
    根据所述待分析特征层进行动态阈值图预测,得到目标动态阈值图;Predicting a dynamic threshold map according to the feature layer to be analyzed to obtain a target dynamic threshold map;
    根据所述目标文本概率图和所述目标动态阈值图进行可微分二值化计算,得到可微分二值化图;performing differentiable binarization calculations according to the target text probability map and the target dynamic threshold map to obtain a differentiable binarization map;
    根据所述可微分二值化图进行所述文本区域生成。The text region is generated according to the differentiable binarization map.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述待描述图像,对每个所述文本区域进行文本识别,得到待分析文本的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of performing text recognition on each of the text regions according to the image to be described to obtain the text to be analyzed includes:
    根据每个所述文本区域,从所述待描述图像中提取图像区块,得到文本图像区块;Extracting image blocks from the image to be described according to each text region to obtain text image blocks;
    采用基于卷积神经网络得到的模型,对每个所述文本图像区块进行预设高度的特征图提取,得到特征图集;Using a model obtained based on a convolutional neural network, extracting a feature map with a preset height for each of the text image blocks to obtain a feature map set;
    将每个所述特征图集中的各个所述特征图按位置进行排序,得到时序特征图集;Sorting each of the feature maps in each of the feature atlases by position to obtain a time series feature atlas;
    将每个所述时序特征图集输入基于循环神经网络得到的模型进行文本识别,得到每个所述文本区域对应的所述待分析文本,其中,采用预设标签字典中的各个预设标签作为所述基于循环神经网络得到的模型的嵌入层的输出维度的预测标签,所述预设标签包括:文本和占位符。Each of the time-series feature atlases is input into a model obtained based on a cyclic neural network for text recognition to obtain the text to be analyzed corresponding to each of the text regions, wherein each preset tag in the preset tag dictionary is used as the The prediction label of the output dimension of the embedding layer of the model obtained based on the cyclic neural network, the preset label includes: text and a placeholder.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述待描述图像进行目标特征提取的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of extracting target features according to the image to be described comprises:
    对所述待描述图像进行图像特征图提取,得到待分析图像特征图;Performing image feature map extraction on the image to be described to obtain an image feature map to be analyzed;
    采用基于区域生成网络得到的模型,根据所述待描述图像进行目标候选区域提取;Using the model obtained based on the region generation network, extracting target candidate regions according to the image to be described;
    根据每个所述目标候选区域,从所述待分析图像特征图中提取图像特征,得到目标外观特征;According to each of the target candidate regions, image features are extracted from the image feature map to be analyzed to obtain target appearance features;
    根据每个所述区域图像特征进行分类预测,得到分类预测结果,其中,分类预测的分类标签包括:多个物体标签和一个背景标签;Classification prediction is performed according to the image features of each region to obtain a classification prediction result, wherein the classification labels of the classification prediction include: a plurality of object labels and a background label;
    根据每个所述目标外观特征进行位置回归处理,得到目标位置信息;Perform position regression processing according to each target appearance feature to obtain target position information;
    采用基于全卷积网络得到的模型,根据每个所述目标外观特征进行掩膜图生成,得到目标掩膜图;Using a model obtained based on a fully convolutional network, a mask map is generated according to each of the target appearance features to obtain a target mask map;
    将同一所述目标候选区域对应的所述目标外观特征、所述目标位置信息和所述目标掩膜图作为一个所述目标特征。The target appearance feature, the target position information and the target mask map corresponding to the same target candidate area are used as one target feature.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述基于多模态特征融合的方法,根据所述待描述图像、各个所述待分析文本和各个所述目标特征进行图像描述生成,得到图像描述结果的步骤,包括:The computer-readable storage medium according to claim 16, wherein, in the method based on multimodal feature fusion, an image description is generated according to the image to be described, each of the texts to be analyzed, and each of the target features, The steps for obtaining the image description result include:
    根据每个所述目标特征进行特征融合,得到第一融合特征;performing feature fusion according to each of the target features to obtain a first fusion feature;
    根据所述待描述图像,对每个所述待分析文本进行特征融合,得到第二融合特征;performing feature fusion on each of the text to be analyzed according to the image to be described to obtain a second fusion feature;
    采用基于迭代的Transformer得到的模型,根据各个所述第一融合特征和各 个所述第二融合特征进行词预测,得到词预测结果;Adopt the model that obtains based on iterative Transformer, carry out word prediction according to each described first fusion feature and each described second fusion feature, obtain word prediction result;
    采用基于动态指针网络得到的模型,根据所述词预测结果和各个所述待分析文本进行图像描述生成,得到所述图像描述结果。A model obtained based on a dynamic pointer network is used to generate an image description according to the word prediction result and each of the texts to be analyzed to obtain the image description result.
PCT/CN2022/090158 2022-01-11 2022-04-29 Artificial intelligence-based image description generation method and apparatus, device, and medium WO2023134073A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210028089.7 2022-01-11
CN202210028089.7A CN114387430B (en) 2022-01-11 2022-01-11 Image description generation method, device, equipment and medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
WO2023134073A1 true WO2023134073A1 (en) 2023-07-20

Family

ID=81201154

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090158 WO2023134073A1 (en) 2022-01-11 2022-04-29 Artificial intelligence-based image description generation method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN114387430B (en)
WO (1) WO2023134073A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116630465A (en) * 2023-07-24 2023-08-22 海信集团控股股份有限公司 Model training and image generating method and device
CN116843030A (en) * 2023-09-01 2023-10-03 浪潮电子信息产业股份有限公司 Causal image generation method, device and equipment based on pre-training language model
CN116912629A (en) * 2023-09-04 2023-10-20 小舟科技有限公司 General image text description generation method and related device based on multi-task learning
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system
CN117557883A (en) * 2024-01-12 2024-02-13 中国科学技术大学 Medical multi-mode content analysis and generation method based on pathology alignment diffusion network
CN117593392A (en) * 2023-09-27 2024-02-23 书行科技(北京)有限公司 Image generation method, device, computer equipment and computer readable storage medium
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities
CN117829098A (en) * 2024-03-06 2024-04-05 天津创意星球网络科技股份有限公司 Multi-mode work review method, device, medium and equipment
CN117611245B (en) * 2023-12-14 2024-05-31 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114387430B (en) * 2022-01-11 2024-05-28 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence
CN114627353B (en) * 2022-03-21 2023-12-12 北京有竹居网络技术有限公司 Image description generation method, device, equipment, medium and product
CN114821271B (en) * 2022-05-19 2022-09-16 平安科技(深圳)有限公司 Model training method, image description generation device and storage medium
CN116051811B (en) * 2023-03-31 2023-07-04 深圳思谋信息科技有限公司 Region identification method, device, computer equipment and computer readable storage medium
CN116912851A (en) * 2023-07-25 2023-10-20 京东方科技集团股份有限公司 Image processing method, device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps
CN110033008A (en) * 2019-04-29 2019-07-19 同济大学 A kind of iamge description generation method concluded based on modal transformation and text
CN111368118A (en) * 2020-02-13 2020-07-03 中山大学 Image description generation method, system, device and storage medium
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN114387430A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102905056B (en) * 2012-10-18 2015-09-02 利亚德光电股份有限公司 Method of video image processing and device
KR20230129195A (en) * 2017-04-25 2023-09-06 더 보드 어브 트러스티스 어브 더 리랜드 스탠포드 주니어 유니버시티 Dose reduction for medical imaging using deep convolutional neural networks
CN110781967B (en) * 2019-10-29 2022-08-19 华中科技大学 Real-time text detection method based on differentiable binarization
CN111581510B (en) * 2020-05-07 2024-02-09 腾讯科技(深圳)有限公司 Shared content processing method, device, computer equipment and storage medium
CN113806587A (en) * 2021-08-24 2021-12-17 西安理工大学 Multi-mode feature fusion video description text generation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps
CN110033008A (en) * 2019-04-29 2019-07-19 同济大学 A kind of iamge description generation method concluded based on modal transformation and text
CN111368118A (en) * 2020-02-13 2020-07-03 中山大学 Image description generation method, system, device and storage medium
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN114387430A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116630465A (en) * 2023-07-24 2023-08-22 海信集团控股股份有限公司 Model training and image generating method and device
CN116630465B (en) * 2023-07-24 2023-10-24 海信集团控股股份有限公司 Model training and image generating method and device
CN116843030A (en) * 2023-09-01 2023-10-03 浪潮电子信息产业股份有限公司 Causal image generation method, device and equipment based on pre-training language model
CN116843030B (en) * 2023-09-01 2024-01-19 浪潮电子信息产业股份有限公司 Causal image generation method, device and equipment based on pre-training language model
CN116912629B (en) * 2023-09-04 2023-12-29 小舟科技有限公司 General image text description generation method and related device based on multi-task learning
CN116912629A (en) * 2023-09-04 2023-10-20 小舟科技有限公司 General image text description generation method and related device based on multi-task learning
CN116935418B (en) * 2023-09-15 2023-12-05 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system
CN117593392A (en) * 2023-09-27 2024-02-23 书行科技(北京)有限公司 Image generation method, device, computer equipment and computer readable storage medium
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities
CN117611245B (en) * 2023-12-14 2024-05-31 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities
CN117557883A (en) * 2024-01-12 2024-02-13 中国科学技术大学 Medical multi-mode content analysis and generation method based on pathology alignment diffusion network
CN117829098A (en) * 2024-03-06 2024-04-05 天津创意星球网络科技股份有限公司 Multi-mode work review method, device, medium and equipment
CN117829098B (en) * 2024-03-06 2024-05-28 天津创意星球网络科技股份有限公司 Multi-mode work review method, device, medium and equipment

Also Published As

Publication number Publication date
CN114387430A (en) 2022-04-22
CN114387430B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
WO2023134073A1 (en) Artificial intelligence-based image description generation method and apparatus, device, and medium
RU2691214C1 (en) Text recognition using artificial intelligence
Kang et al. Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition
CN109524006B (en) Chinese mandarin lip language identification method based on deep learning
US10354168B2 (en) Systems and methods for recognizing characters in digitized documents
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
CN113297975A (en) Method and device for identifying table structure, storage medium and electronic equipment
US20240013005A1 (en) Method and system for identifying citations within regulatory content
CN112541355B (en) Entity boundary type decoupling few-sample named entity recognition method and system
CN110516541B (en) Text positioning method and device, computer readable storage medium and computer equipment
CN111160348A (en) Text recognition method for natural scene, storage device and computer equipment
Feng et al. Focal CTC loss for chinese optical character recognition on unbalanced datasets.
CN111553350A (en) Attention mechanism text recognition method based on deep learning
CN113486669A (en) Semantic recognition method for emergency rescue input voice
CN111914654A (en) Text layout analysis method, device, equipment and medium
CN115546506A (en) Image identification method and system based on double-pooling channel attention and cavity convolution
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
CN116229482A (en) Visual multi-mode character detection recognition and error correction method in network public opinion analysis
CN114445808A (en) Swin transform-based handwritten character recognition method and system
CN113159053A (en) Image recognition method and device and computing equipment
Huang et al. Attention after attention: Reading text in the wild with cross attention
CN115186670B (en) Method and system for identifying domain named entities based on active learning
Hoxha et al. Retrieving images with generated textual descriptions
Li et al. Deep neural network with attention model for scene text recognition
CN115019319A (en) Structured picture content identification method based on dynamic feature extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919714

Country of ref document: EP

Kind code of ref document: A1