WO2023178801A1 - 图像描述方法和装置、计算机设备、存储介质 - Google Patents

图像描述方法和装置、计算机设备、存储介质 Download PDF

Info

Publication number
WO2023178801A1
WO2023178801A1 PCT/CN2022/090723 CN2022090723W WO2023178801A1 WO 2023178801 A1 WO2023178801 A1 WO 2023178801A1 CN 2022090723 W CN2022090723 W CN 2022090723W WO 2023178801 A1 WO2023178801 A1 WO 2023178801A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
word
topic
generation model
vector
Prior art date
Application number
PCT/CN2022/090723
Other languages
English (en)
French (fr)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023178801A1 publication Critical patent/WO2023178801A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to an image description method and device, computer equipment, and storage media.
  • image description technology has become a solution for understanding image content.
  • image description technology is used to enable the computer to understand the content of the image and generate corresponding description text.
  • target detection is generally performed on the original image and the corresponding description text is generated.
  • an image description method which method includes:
  • the regional feature vector is extracted and processed through a topic generation model to obtain topic data; wherein the topic data includes a topic word vector and time status information corresponding to the topic word vector;
  • Each description word is spliced according to the time status information to obtain a target description text; wherein the target description text is used to describe the original image.
  • an image description device which includes:
  • Image acquisition module used to acquire original images
  • the first feature extraction module used to extract features from the original image to obtain image features
  • Area detection module used to perform area detection on the original image according to the image features to obtain a target area image
  • the second feature extraction module used to extract features from the target area image to obtain a regional feature vector
  • Data extraction module used to extract and process the regional feature vector through a topic generation model to obtain topic data; wherein the topic data includes a topic word vector and time status information corresponding to the topic word vector;
  • Word prediction module used to predict words on the subject data through the word generation model to obtain description words
  • Word splicing module used to splice each description word according to the moment status information to obtain a target description text; wherein the target description text is used to describe the original image.
  • inventions of the present application provide a computer device.
  • the computer device includes a memory and a processor, wherein a computer program is stored in the memory.
  • the computer program is executed by the processor, the computer program
  • the processor is configured to execute an image description method, wherein the image description method includes:
  • the regional feature vector is extracted and processed through a topic generation model to obtain topic data; wherein the topic data includes a topic word vector and time status information corresponding to the topic word vector;
  • Each description word is spliced according to the time status information to obtain a target description text; wherein the target description text is used to describe the original image.
  • inventions of the present application provide a storage medium.
  • the storage medium is a computer-readable storage medium.
  • the storage medium stores computer-executable instructions.
  • the computer-executable instructions are used to cause the computer to execute an image. Description method, wherein the image description method includes:
  • the regional feature vector is extracted and processed through a topic generation model to obtain topic data; wherein the topic data includes a topic word vector and time status information corresponding to the topic word vector;
  • Each description word is spliced according to the time status information to obtain a target description text; wherein the target description text is used to describe the original image.
  • the image description method and device, computer equipment, and storage media proposed in the embodiments of this application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of region detection on the original image,
  • the topic generation model and the word generation model are used in turn to hierarchically generate target description text, which can generate description text with coherent semantics.
  • Figure 1 is a flow chart of an image description method provided by an embodiment of the present application.
  • FIG. 2 is a flow chart of step S400 in Figure 1;
  • FIG. 3 is a flow chart of step S410 in Figure 2;
  • FIG. 4 is a flow chart of step S412 in Figure 3;
  • FIG. 5 is a flow chart of step S500 in Figure 1;
  • Figure 6 is a flow chart of step S600 in Figure 1;
  • FIG. 7 is a flow chart of step S640 in Figure 6;
  • Figure 8 is a module structure block diagram of an image description device provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of the hardware structure of a computer device provided by an embodiment of the present application.
  • Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Image Caption It is a comprehensive problem that integrates computer vision, natural language processing and machine learning. It is similar to translating a picture into a description text. This task is very easy for humans, but very challenging for machines. It not only requires the use of models to understand the content of the pictures, but also requires the use of natural language to express the relationships between them.
  • Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts with an initial set of measurement data and establishes derived values (features) designed to be informative and non-redundant, thereby facilitating subsequent learning and generalization. steps, and in some cases leads to better interpretability. Feature extraction is related to dimensionality reduction. The quality of features has a crucial impact on generalization ability.
  • Feature Map It is the result of the input image being convolved by the neural network. It represents a feature in the neural space, and its resolution depends on the step size of the previous convolution kernel.
  • CNN Convolutional Neural Networks
  • FNN feedforward Neural Networks
  • LTM Long Short-Term Memory Network
  • RNN general Recurrent Neural Network
  • Pooling It is an important concept in convolutional neural networks. It is actually a form of downsampling. There are many different forms of nonlinear pooling functions, among which "Max pooling" is the most common. It divides the input image into several rectangular areas and outputs the maximum value for each sub-area. Intuitively, the reason this mechanism works is that once a feature is discovered, its precise location is less important than its relative position to other features. The pooling layer will continuously reduce the spatial size of the data, so the number of parameters and the amount of calculation will also decrease, which also controls overfitting to a certain extent. Generally speaking, pooling layers are periodically inserted between the convolutional layers of CNN. Pooling layers typically act on each input feature separately and reduce its size.
  • VGG Visual Geometry Group
  • VGG It is one of the CNN models, that is, a deep convolutional neural network constructed by using a series of small-size convolution kernels and pooling layers of size 3x3.
  • VGG uses 3 ⁇ 3 convolutional layers and pooling layers to extract features, and uses three fully connected layers at the end of the network, using the output of the last fully connected layer as the prediction of classification.
  • each layer of convolution will use ReLU as the activation function, and dropout will be added after the fully connected layer to suppress overfitting.
  • Region Proposal Network Its main function is to generate region candidates (Region Proposal), that is to say, region candidates can be regarded as many potential bounding boxes (also called anchors, which contain 4 coordinates rectangular frame).
  • Faster R-CNN It is a regional convolutional neural network, which can be simply regarded as a model of region generation network + Fast-R-CNN, using region generation network (Region Proposal Network, RPN) to replace Fast-R-CNN.
  • RPN Region Proposal Network
  • Feature mapping also called dimensionality reduction, is the process of mapping the feature vectors of high-dimensional multimedia data to one-dimensional or low-dimensional space.
  • GRU Gate Recurrent Unit
  • RNN recurrent neural networks
  • GRU is a gating mechanism in recurrent neural networks (RNN). Similar to other gating mechanisms, it aims to solve the gradient disappearance/explosion problem in standard RNN and simultaneously Preserve long-term information of the sequence. GRU is as good as LSTM on many sequence tasks such as speech recognition, but it has fewer parameters than LSTM and only contains a reset gate and an update gate.
  • Bilinear interpolation Also called bilinear interpolation, its core idea is to perform linear interpolation in two directions.
  • Bilinear interpolation as an interpolation algorithm in numerical analysis, is widely used in signal processing, digital image and video processing, etc.
  • Activation functions play a very important role in artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear characteristics into our network. In the neurons, the inputs are weighted and summed, and then acted on a function. This function is the activation function. The activation function is introduced to increase the efficiency of the neural network model. Non-linear.
  • Fully connected layer Each node is connected to all nodes in the previous layer, which is used to synthesize the previously extracted features. Due to its fully connected characteristics, the fully connected layer generally has the most parameters.
  • Hidden layer In a neural network, all layers except the input layer and output layer are called hidden layers.
  • the hidden layer does not directly receive signals from the outside world, nor does it directly send signals to the outside world.
  • the meaning of a single hidden layer is to abstract the characteristics of the input data into another dimensional space to show its more abstract characteristics, which can be better linearly divided.
  • the meaning of multiple hidden layers is multi-level abstraction of input features, and the ultimate goal is to better linearly divide different types of data.
  • Cross Entropy Mainly used to measure the difference information between two probability distributions.
  • the performance of language models is usually measured by cross-entropy and complexity (perplexity).
  • the meaning of cross-entropy is the difficulty of text recognition using this model, or from a compression perspective, how many bits are used to encode each word on average.
  • the meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word.
  • Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model.
  • Cross-entropy loss function It is a smooth function, and its essence is the application of cross-entropy in information theory in classification problems.
  • Examples of classifiers corresponding to the cross-entropy loss function include logistic regression, artificial neural networks, and support vector machines with probabilistic output.
  • Gradient Descent is a type of iterative method that can be used to solve least squares problems (both linear and nonlinear).
  • gradient descent is one of the most commonly used methods.
  • Another commonly used method is the least squares method.
  • the gradient descent method can be used to iteratively solve the problem step by step to obtain the minimized loss function and model parameter values.
  • the gradient ascent method is a type of iterative method that can be used to solve the problem step by step to obtain the minimized loss function and model parameter values.
  • two gradient descent methods have been developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.
  • image description technology has become a solution for understanding image content.
  • image description technology is used to enable the computer to understand the content of the image and generate corresponding description text.
  • target detection is generally performed on the original image and the corresponding description text is generated directly based on the target detection results.
  • this will cause the generated description text to be scattered, which will lead to the semantics of the description text being inconsistent.
  • embodiments of the present application propose an image description method and device, computer equipment, and storage media, which can obtain the original image by acquiring the original image; performing feature extraction on the original image to obtain image features; and performing area detection on the original image according to the image features to obtain Target area image; perform feature extraction on the target area image to obtain a regional feature vector; extract and process the regional feature vector through a topic generation model to obtain topic data; where the topic data includes topic word vectors and time status information corresponding to the topic word vectors ; Use the word generation model to perform word prediction on the topic data to obtain the description words; perform splicing processing on each description word according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the embodiment of the present application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of regional detection of the original image, the topic generation model and the word generation model are used to generate the target hierarchically.
  • Description text capable of generating description text with coherent semantics.
  • Embodiments of the present application provide image description methods and devices, computer equipment, and storage media, which are specifically described through the following embodiments. First, the image description method in the embodiment of the present application is described.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the image description method provided by the embodiments of this application relates to the field of artificial intelligence.
  • the image description method provided by the embodiment of the present application can be applied in a terminal or a server, or can be software running in a terminal or a server.
  • the terminal can be a smartphone, a tablet, a laptop, a desktop computer, a smart watch, etc.
  • the server can be configured as an independent physical server, or as a server cluster or distributed server composed of multiple physical servers.
  • the system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the cloud server of the service; the software can implement the application of the image description method, etc., but is not limited to the above forms.
  • Embodiments of the present application can be used in numerous general-purpose or special-purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above systems or devices, etc.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
  • the image description method according to the first aspect of the embodiment of the present application includes, but is not limited to, steps S100 to S800.
  • Step S100 obtain the original image
  • Step S200 perform feature extraction on the original image to obtain image features
  • Step S300 Perform area detection on the original image according to image features to obtain the target area image
  • Step S400 perform feature extraction on the target area image to obtain a regional feature vector
  • Step S500 extract and process regional feature vectors through the topic generation model to obtain topic data
  • Step S600 perform word prediction on the topic data through the word generation model to obtain description words
  • Step S700 perform splicing processing on each description word according to the time status information to obtain the target description text.
  • an original image is obtained.
  • the original image refers to an image that requires text description.
  • the purpose of the embodiments of this application is to translate the content of the original image into text, that is, target description text.
  • step S200 feature extraction is performed on the original image to obtain image features.
  • an encoder such as the VGG-16 network in the VGG model, can be used to extract features from the original image.
  • the original image By inputting it into the VGG-16 network for feature extraction processing, the feature image output by the model can be obtained, namely in Among them, W, H, and C are all image sizes, W represents the width, H represents the height, C represents the number of channels, W′ is the width after passing through the VGG-16 network, and H′ is the height after passing through the VGG-16 network.
  • the specific network layer and the number of network layers in the VGG-16 network can be set according to actual needs.
  • the specific network layer and the number of network layers of the VGG-16 model in the embodiment of the present application can be set. It is: 13 convolutional layers, 3 fully connected layers and 5 pooling layers.
  • the embodiment of the present application may also consider removing the last pooling layer.
  • the role of the pooling layer is to reduce the size of the parameter matrix. For example, the width and height of the parameter matrix can be reduced to half of the original size, and the number of network channels does not change.
  • adding a pooling layer to a neural network can filter out some unimportant information during the image compression process.
  • one pooling layer may be considered. layers to retain more image features to improve the accuracy of region detection.
  • step S300 of some embodiments region detection is performed on the original image according to the image features to obtain the target region image.
  • the image features need to be input into the RPN for region detection.
  • the model will automatically divide at least one region based on the image features.
  • step S400 feature extraction is performed on the target region image to obtain a region feature vector, which is used to generate target description text.
  • the regional feature vector is extracted and processed through the topic generation model to obtain topic data.
  • the topic sentence data generated by the topic generation model can determine how many sentences are included in the generated image description, that is, how many sentences are included in the generated image description.
  • a topic; among them, the topic data includes the topic word vector and the time status information of the corresponding topic word vector.
  • the topic word vector and the time status information are used for subsequent word prediction and generation of words corresponding to the sentence.
  • step S600 word prediction is performed on the topic data through a word generation model to obtain multiple description words corresponding to the sentence.
  • each description word in each sentence is spliced according to the time status information to obtain a generated topic sentence, and multiple topic sentences are spliced to form a target description text, where the target description Text is used to describe the original image.
  • topic vector 1 for word prediction can generate multiple description words such as "There”, “is”, “a” and “competition”. Splicing the above description words can generate topic sentence 1 , that is, "There is a competition”; the multiple description words that can be generated by word prediction using topic vector 2 are "Three”, “women”, “are”, “playing", “field” and “hockey”.
  • Splicing the description words can generate topic sentence 2, that is, "Three women are playing field hockey”; using topic vector 3 for word prediction, the multiple description words that can be generated are “The”, “one”, “in”, and “red” , “is”, “hitting”, “the” and “ball”, splicing the above description words can generate topic sentence 3, that is, "The one in red is hitting the ball”; using topic vector 4 for word prediction can generate The multiple descriptive words are "The”, “other”, “two”, “in”, “white”, “are” and "defending”. Splicing the above descriptive words can generate topic sentence 4, which is "The other” two in white are defending”; Splice the above generated topic sentence 1, topic sentence 2, topic sentence 3 and topic sentence 4 to form the final target description text.
  • the embodiment of this application obtains the original image; performs feature extraction on the original image to obtain image features; performs area detection on the original image according to the image features to obtain the target area image; performs feature extraction on the target area image to obtain the area feature vector; through subject
  • the generative model extracts and processes regional feature vectors to obtain topic data; among them, the topic data includes topic word vectors and time status information corresponding to the topic word vectors; word prediction is performed on the topic data through the word generation model to obtain description words; according to the time status
  • the information splices each description word to obtain the target description text; among which, the target description text is used to describe the original image.
  • the embodiment of the present application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of regional detection of the original image, the topic generation model and the word generation model are used to generate the target hierarchically.
  • Description text capable of generating description text with coherent semantics.
  • step S400 specifically includes but is not limited to step S410, step S420 and step S430.
  • Step S410 preprocess the target area image to obtain a preliminary area image
  • Step S420 perform convolution processing on the preliminary region image to obtain a convolution feature vector
  • Step S430 perform pooling processing on the convolution feature vector to obtain a regional feature vector.
  • the target area image is preprocessed to obtain a preliminary area image that satisfies preset conditions, where the preset conditions may refer to specific image size, image size, etc.
  • the preliminary region image is subjected to convolution processing to obtain a convolution feature image.
  • the image features of the preliminary region image can be input to a certain number of fully connected layers, such as two fully connected layers.
  • the image features of each preliminary region image are convolved by the connection layer and compressed into a D-dimensional vector, denoted as It should be noted that the dimension of the feature vector is consistent with the number of target regions or the number of preliminary region images.
  • the convolution feature vector is pooled to obtain a regional feature vector.
  • the D-dimensional convolution feature vector is input to the pooling layer for average pooling or maximum pooling, and retains The relevant information of the characteristic region is obtained to obtain the regional characteristic vector.
  • v p is the regional feature vector obtained by pooling
  • P represents the number of convolution feature vectors and also represents the dimension of the regional feature vector.
  • step S410 specifically includes but is not limited to step S411 and step S412.
  • Step S411 perform feature mapping on the target area image to obtain a preliminary mapping image
  • Step S412 Perform size transformation on the preliminary mapping image according to the preset size to obtain a preliminary region image.
  • step S411 of some embodiments after performing area detection on the original image, it is also necessary to perform feature mapping on the target area image to obtain a preliminary mapping image, that is, a feature map corresponding to the target area image.
  • the size of the preliminary mapping image is adjusted to a preset size, for example, the size of the preliminary mapping image is adjusted to X ⁇ Y, where X is the preset length and Y is the preset width, to obtain the preliminary area. image.
  • step S412 specifically includes but is not limited to step S4121, step S4122 and step S4123.
  • Step S4121 obtain the first coordinate of the target area image
  • Step S4122 calculate the second coordinates based on the first coordinates and the preset size
  • Step S4123 Adjust the size of the preliminary mapping image according to the second coordinates to obtain a preliminary region image.
  • the first coordinate of the target area image is obtained. Specifically, assuming that the target area image is I′, the first coordinate refers to the coordinate (x′ i, j ,y′ i,j ).
  • step S4122 of some embodiments according to the first coordinates and the preset size, the back-projection coordinates of the preliminary mapping image to the target area image, that is, the second coordinates, are calculated.
  • the coordinate value of any point coordinate (x′′ i, j , y′′ i, j ) in the preliminary mapping image I′′ projected into the target area image I′ is:
  • the bilinear interpolation method to calculate the pixel of the (x′ i,j ,y′ i,j ) coordinate point in I′, which is the pixel I′′ c,i,j of the corresponding point in I′′, that is, the Two coordinates
  • the calculation formula (2) is as follows:
  • k(d) max(0,1-
  • ) represents the distance between two points.
  • the size of the preliminary mapping image is adjusted according to the second coordinates to obtain a preliminary region image.
  • the preliminary preset image is scaled according to the second coordinates, for example, adjusting the coordinates of a certain point of the preset image to the position of the second coordinates, or adjusting the preset image with reference to the position of the second coordinates. The coordinates of the image, etc.
  • step S500 specifically includes but is not limited to step S510, step S520 and step S530.
  • Step S510 input the regional feature vector to the topic generation model; wherein the topic generation model includes a recurrent layer and a hidden layer;
  • Step S520 perform loop iterative processing on the regional feature vector through the loop layer to obtain the subject word vector
  • Step S530 Obtain time status information from the hidden layer according to the topic word vector.
  • the regional feature vector is input to the topic generation model; wherein the topic generation model includes a loop layer and a hidden layer, the loop layer is used to generate topic word vectors, and the hidden layer is used to output moment state information; in practical applications, the topic generation module in the embodiment of the present application may adopt a single-layer LSTM model.
  • the regional feature vector is processed iteratively through the loop layer of the topic generation model to generate a topic word vector, that is, the topic word vector generated by the LSTM model determines the number of sentences contained in the final image description. Number, that is, the number of topics.
  • time status information is obtained from the hidden layer of the LSTM according to the topic word vector, where the time status information refers to the hidden layer of the hidden layer in the LSTM model at the time corresponding to the generation of the topic word vector.
  • H the hidden layer size.
  • step S600 specifically includes but is not limited to step S610, step S620, step S630 and step S640.
  • Step S610 obtain the activation function of the word generation model
  • Step S620 input the topic word vector and time status information into the word generation model
  • Step S630 Calculate the topic word vector and time status information according to the activation function to obtain at least one candidate word
  • Step S640 Obtain description words from candidate words.
  • step S610 of some embodiments the activation function of the word generation model is obtained, where the word generation model is used to generate descriptive words of the corresponding topic sentence according to the topic data.
  • the word generation module can use a two-layer GRU model.
  • step S620 of some embodiments the topic word vector and the moment state information are used as inputs to the word generation model.
  • the topic word vector and the moment state information are calculated according to the activation function to obtain at least one candidate word.
  • the initial input of the word generation model is the topic word vector output by the topic generation model at that moment, and the corresponding hidden state of the hidden layer in the word generation model at each moment is input into the word generation model. It should be noted that the process of calculating candidate words is as shown in formula (3), formula (4) and formula (5):
  • x -1 represents the P-dimensional topic word vector generated by sentence-level LSTM, which is used as the initial input of the GRU model
  • S t represents the candidate word generated by the GRU model
  • S 0 is the start mark. It is the hidden state of the hidden layer in the topic generation model at the previous moment
  • p t+1 represents the distribution rate of the t+1th candidate word in the entire preset word set.
  • a descriptive word is selected from the candidate words for generating a topic sentence.
  • step S640 specifically includes but is not limited to step S641 and step S642.
  • Step S641 calculate the distribution probability of each candidate word in the preset word set
  • Step S642 Obtain the candidate word with the largest distribution probability as the description word.
  • step S641 of some embodiments the distribution probability of each candidate word in the preset word set is calculated. Specifically, it can be calculated through formula (6):
  • p t+1 represents the distribution rate of the t+1th candidate word in the entire preset word set
  • I represents the preset word set
  • the candidate word with the largest distribution probability is obtained as the description word. Specifically, the candidate word with the highest distribution probability is selected as the output of the t-th word in the corresponding sentence. Until the hidden state of the candidate word with the highest probability corresponds to an end flag, the sentence generation ends and the iteration terminates. In addition, after each GRU model generates description words for its respective topic sentences, these topic sentences are connected according to the time status information to form a total image description paragraph.
  • the topic generation model and the word generation model can form a general image description model.
  • the loss function of the entire image description model is the weighted sum corresponding to the hierarchical recurrent network of the topic generation model and the sting generation model.
  • the gradient descent algorithm can be used to update the model parameters of the topic generation model based on the calculated loss value, so as to obtain a trained topic generation model and further improve the accuracy of image description.
  • the image description method proposed in the embodiment of this application obtains the original image; performs feature extraction on the original image to obtain image features; performs area detection on the original image according to the image features to obtain the target area image; performs feature extraction on the target area image to obtain Regional feature vectors; extract and process the regional feature vectors through the topic generation model to obtain topic data; among which, the topic data includes topic word vectors and the time status information of the corresponding topic word vectors; perform word prediction on the topic data through the word generation model, and obtain Description words; each description word is spliced according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the embodiment of the present application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of regional detection of the original image, the topic generation model and the word generation model are used to generate the target hierarchically.
  • Description text capable of generating description text with coherent semantics.
  • an image description device 800 which can implement the above image description method.
  • the image description device 800 includes: an image acquisition module 810, a first feature extraction module 820, and a region detection module. 830.
  • the image description device 800 in the embodiment of the present application is used to execute the image description method in the above embodiment.
  • the specific processing process is the same as the image description method in the above embodiment, and will not be described again here.
  • the image description device 800 proposed in the embodiment of this application obtains the original image; performs feature extraction on the original image to obtain image features; performs area detection on the original image according to the image features to obtain a target area image; and performs feature extraction on the target area image.
  • the description words are obtained; each description word is spliced according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the embodiment of the present application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of regional detection of the original image, the topic generation model and the word generation model are used to generate the target hierarchically.
  • Description text capable of generating description text with coherent semantics.
  • An embodiment of the present application also provides a computer device, including:
  • At least one processor and,
  • a memory communicatively connected to at least one processor; wherein,
  • the memory stores instructions, and the instructions are executed by at least one processor, so that when at least one processor executes the instructions, an image description method is implemented, wherein the image description method includes:
  • the regional feature vectors are extracted and processed through the topic generation model to obtain topic data; among them, the topic data includes topic word vectors and moment-to-moment status information corresponding to the topic word vectors;
  • Each description word is spliced according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the computer device includes: a processor 910, a memory 920, an input/output interface 930, a communication interface 940, and a bus 950.
  • the processor 910 can be implemented by a general central processing unit (Central Processin Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, for execution. Relevant procedures to implement the technical solutions provided by the embodiments of this application;
  • CPU Central Processin Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 920 can be implemented in the form of read-only memory (Read Only Memory, ROM), static storage device, dynamic storage device, or random access memory (Random Access Memory, RAM).
  • the memory 920 can store operating systems and other application programs.
  • the relevant program codes are stored in the memory 920 and called by the processor 910 to execute the implementation of this application.
  • Example image description method ;
  • Input/output interface 930 used to implement information input and output
  • Communication interface 940 is used to realize communication interaction between this device and other devices. Communication can be achieved through wired methods (such as USB, network cables, etc.) or wireless methods (such as mobile network, WIFI, Bluetooth, etc.); and
  • Bus 950 which transmits information between various components of the device (such as processor 910, memory 920, input/output interface 930, and communication interface 940);
  • the processor 910, the memory 920, the input/output interface 930 and the communication interface 940 implement communication connections between each other within the device through the bus 950.
  • Embodiments of the present application also provide a storage medium.
  • the storage medium is a computer-readable storage medium.
  • the computer-readable storage medium stores computer-executable instructions.
  • the computer-executable instructions are used to cause the computer to execute an image description method.
  • the image description method includes:
  • the regional feature vectors are extracted and processed through the topic generation model to obtain topic data; among them, the topic data includes topic word vectors and moment-to-moment status information corresponding to the topic word vectors;
  • Each description word is spliced according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the computer-readable storage medium may be non-volatile or volatile.
  • memory can be used to store non-transitory software programs and non-transitory computer executable programs.
  • the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the memory may optionally include memory located remotely from the processor, and the remote memory may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the image description method and device, computer equipment, and storage medium proposed in the embodiments of this application obtain the original image; perform feature extraction on the original image to obtain image features; perform area detection on the original image according to the image features to obtain the target area image; Features are extracted from the target area image to obtain the regional feature vector; the regional feature vector is extracted and processed through the topic generation model to obtain topic data; among them, the topic data includes the topic word vector and the time status information of the corresponding topic word vector; through the word generation model Perform word prediction on the topic data to obtain the description words; perform splicing processing on each description word according to the time status information to obtain the target description text; among which, the target description text is used to describe the original image.
  • the embodiment of the present application can make the generated target description text contain more image details through multiple feature extractions; in addition, on the basis of regional detection of the original image, the topic generation model and the word generation model are used to generate the target hierarchically.
  • Description text capable of generating description text with coherent semantics.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • At least one (item) refers to one or more, and “plurality” refers to two or more.
  • “And/or” is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, “A and/or B” can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character “/” generally indicates that the related objects are in an "or” relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ”, where a, b, c can be single or multiple.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the application.
  • the aforementioned storage media include: U disk, mobile hard disk, Read-Only Memory (ROM), Random Access Memory (RAM), magnetic disk or optical disk and other media that can store programs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供一种图像描述方法和装置、计算机设备、存储介质,属于人工智能技术领域。包括:获取原始图像,对原始图像进行特征提取得到图像特征;根据图像特征对原始图像进行区域检测得到目标区域图像;对目标区域图像进行特征提取得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理得到主题数据;主题数据包括主题词向量和时刻状态信息;通过词生成模型对主题数据进行单词预测得到描述单词;根据时刻状态信息对描述单词进行拼接处理,得到用于描述原始图像的目标描述文本。通过多次特征提取,能够使目标描述文本包含更多图像细节;依次利用主题生成模型和词生成模型分层次地生成具有连贯语义的描述文本。

Description

图像描述方法和装置、计算机设备、存储介质
本申请要求于2022年03月22日提交中国专利局、申请号为202210283244.X,发明名称为“图像描述方法和装置、计算机设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种图像描述方法和装置、计算机设备、存储介质。
背景技术
随着人工智能技术的发展,图像描述技术已经成为理解图像内容的一种解决方法。其中,图像描述技术用来使计算机理解图像的内容,并使其生成对应的描述文本。目前,一般通过对原始图像进行目标检测并生成对应的描述文本。
技术问题
以下是发明人意识到的现有技术的技术问题:通过对原始图像进行目标检测并生成对应的描述文本,采用该方式会导致描述文本的语义不够连贯。
技术解决方案
第一方面,本申请实施例提出了一种图像描述方法,所述方法包括:
获取原始图像;
对所述原始图像进行特征提取,得到图像特征;
根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
对所述目标区域图像进行特征提取,得到区域特征向量;
通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
通过词生成模型对所述主题数据进行单词预测,得到描述单词;
根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
第二方面,本申请实施例提出了一种图像描述装置,所述装置包括:
图像获取模块:用于获取原始图像;
第一特征提取模块:用于对所述原始图像进行特征提取,得到图像特征;
区域检测模块:用于根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
第二特征提取模块:用于对所述目标区域图像进行特征提取,得到区域特征向量;
数据提取模块:用于通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
单词预测模块:用于通过词生成模型对所述主题数据进行单词预测,得到描述单词;
单词拼接模块:用于根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
第三方面,本申请实施例提出了一种计算机设备,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,所述处理器用于执行一种图像描述方法,其中,所述图像描述方法包括:
获取原始图像;
对所述原始图像进行特征提取,得到图像特征;
根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
对所述目标区域图像进行特征提取,得到区域特征向量;
通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
通过词生成模型对所述主题数据进行单词预测,得到描述单词;
根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
第四方面,本申请实施例提出了一种存储介质,该存储介质为计算机可读存储介质,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种图像描述方法,其中,所述图像描述方法包括:
获取原始图像;
对所述原始图像进行特征提取,得到图像特征;
根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
对所述目标区域图像进行特征提取,得到区域特征向量;
通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
通过词生成模型对所述主题数据进行单词预测,得到描述单词;
根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
本申请实施例的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请实施例而了解。本申请实施例的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
有益效果
本申请实施例提出的图像描述方法和装置、计算机设备、存储介质,通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
附图说明
附图用来提供对本申请实施例技术方案的进一步理解,并且构成说明书的一部分,与本申请实施例的实施例一起用于解释本申请实施例的技术方案,并不构成对本申请实施例技术方案的限制。
图1是本申请实施例提供的图像描述方法的流程图;
图2是图1中的步骤S400的流程图;
图3是图2中的步骤S410的流程图;
图4是图3中的步骤S412的流程图;
图5是图1中的步骤S500的流程图;
图6是图1中的步骤S600的流程图;
图7是图6中的步骤S640的流程图;
图8是本申请实施例提供的图像描述装置的模块结构框图;
图9是本申请实施例提供的计算机设备的硬件结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请公开的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请公开的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请公开的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
首先,对本申请中涉及的若干名词进行解析:
人工智能(Artificial Intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
图像描述(Image Caption):是一个融合计算机视觉、自然语言处理和机器学习的综合问题,它类似于翻译一副图片为一段描述文字。该任务对于人类来说非常容易,但是对于机器却非常具有挑战性,它不仅需要利用模型去理解图片的内容并且还需要用自然语言去表达它们之间的关系。
特征提取:在机器学习、模式识别和图像处理中,特征提取从初始的一组测量数据开始,并建立旨在提供信息和非冗余的派生值(特征),从而促进后续的学习和泛化步骤,并且在某些情况下带来更好的可解释性。特征提取与降维有关。特征的好坏对泛化能力有至关重要的影响。
特征图(Feature Map):是输入图像经过神经网络卷积产生的结果,表征的是神经空间内一种特征,其分辨率大小取决于先前卷积核的步长。
卷积神经网络(Convolutional Neural Networks,CNN):是一类包含卷积计算且具有深度结构的前馈神经网络(Feedforward Neural Networks,FNN),是深度学习(deep learning)的代表算法之一。
长短期记忆网络(LSTM,Long Short-Term Memory)是一种时间循环神经网络,是为了解决一般的循环神经网络(Recurrent Neural Network,RNN)存在的长期依赖问题而专门设计出来的,所有的RNN都具有一种重复神经网络模块的链式形式。在标准RNN中,这个重复的结构模块只有一个非常简单的结构。
池化(Pooling):是卷积神经网络中的一个重要的概念,它实际上是一种形式的降采样。有多种不同形式的非线性池化函数,而其中“最大池化(Max pooling)”是最为常见的。它是将输入的图像划分为若干个矩形区域,对每个子区域输出最大值。直觉上,这种机制能够有效的原因在于,在发现一个特征之后,它的精确位置远不及它和其他特征的相对位置的关系重要。池化层会不断地减小数据的空间大小,因此参数的数量和计算量也会下降,这在一定程度上也控制了过拟合。通常来说,CNN的卷积层之间都会周期性地插入池化层。池化层通常会分别作用于每个输入的特征并减小其大小。
超分辨率测试序列(Visual Geometry Group,VGG):是CNN模型之一,即通过使用一系列大小为3x3的小尺寸卷积核和池化层构造的深度卷积神经网络。VGG使用3×3的卷积层和池化层来提取特征,并在网络的最后使用三层全连接层,将最后一层全连接层的输出作为分类的预测。在VGG中每层卷积将使用ReLU作为激活函数,在全连接层之后添加dropout来抑制过拟合。
区域候选网络(Region Proposal Network,RPN):它的主要功能是生成区域候选(Region Proposal),也就是说,区域候选可以看作是许多潜在的边界框(也叫anchor,它是包含4个坐标的矩形框)。
Faster R-CNN:是一种区域卷积神经网络,可以简单地看成是区域生成网络+Fast-R-CNN的模型,用区域生成网络(Region Proposal Network,RPN)来代替Fast-R-CNN中的选择性搜索方法。其步骤大致为:将整张图片输进CNN,得到卷积特征;将卷积特征输入到RPN,得到候选框的特征信息;对候选框中提取出的特征,使用分类器判别是否属于一个特定类;对于属于某一类别的候选框,用回归器进一步调整其位置。
特征映射:也称降维,是将高维多媒体数据的特征向量映射到一维或者低维空间的过程。
门控循环单元(Gate Recurrent Unit,GRU):GRU是循环神经网络(RNN)中的一种门控机制,与其他门控机制相似,其旨在解决标准RNN中的梯度消失/爆炸问题并同时保留序列的长期信息。GRU在许多诸如语音识别的序列任务上与LSTM同样出色,不过它的参数比LSTM少,仅包含一个重置门(reset gate)和一个更新门(update gate)。
双线性插值法(bilinear interpolation):也叫双线性内插,其核心思想是在两个方向分别进行一次线性插值。双线性插值作为数值分析中的一种插值算法,广泛应用在信号处理,数字图像和视频处理等方面。
激活函数(Activation Functions):激活函数对于人工神经网络模型去学习、理解非常复杂和非线性的函数来说具有十分重要的作用。它们将非线性特性引入到我们的网络中,在神经元中,输入的inputs通过加权,求和后,还被作用了一个函数,这个函数就是激活函数,引入激活函数是为了增加神经网络模型的非线性。
全连接层:是每一个结点都与上一层的所有结点相连,用来把前面提取到的特征综合起来。由于其全相连的特性,一般全连接层的参数也是最多的。
隐藏层:在神经网路中,除输入层和输出层以外的其他各层叫做隐藏层,隐藏层不直接接受外界的信号,也不直接向外界发送信号。单个隐藏层的意义是把输入数据的特征,抽象到另一个维度空间,来展现其更抽象化的特征,这些特征能更好地进行线性划分。多个隐藏层的意义是对输入特征多层次的抽象,最终的目的就是为了更好的线性划分不同类型的数据。
交叉熵(Cross Entropy):主要用于度量两个概率分布间的差异性信息。语言模型的性能通常用交叉熵和复杂度(perplexity)来衡量。交叉熵的意义是用该模型对文本识别的难度,或者从压缩的角度来看,每个词平均要用几个位来编码。复杂度的意义是用该模型表示这一文本平均的分支数,其倒数可视为每个词的平均概率。平滑是指对没观察到的N元组合赋予一个概率值,以保证词序列总能通过语言模型得到一个概率值。
交叉熵损失函数:是一个平滑函数,其本质是信息理论(information theory)中的交叉熵(cross entropy)在分类问题中的应用。交叉熵损失函数对应的分类器例子包括logistic回归、人工神经网络和概率输出的支持向量机。
梯度下降(Gradient Descent):Gradient Descent是迭代法的一种,可以用于求解最小二乘问题(线性和非线性都可以)。在求解机器学习算法的模型参数,即无约束优化问题时,梯度下降是最常采用的方法之一,另一种常用的方法是最小二乘法。在求解损失函数的最小值时,可以通过梯度下降法来一步步的迭代求解,得到最小化的损失函数和模型参数值。反过来,如果需要求解损失函数的最大值,这时就需要用梯度上升法来迭代。在机器学习中,基于基本的梯度下降法发展了两种梯度下降方法,分别为随机梯度下降法和批量梯度下降法。
随着人工智能技术的发展,图像描述技术已经成为理解图像内容的一种解决方法。其中,图像描述技术用来使计算机理解图像的内容,并使其生成对应的描述文本。目前,一般通过对原始图像进行目标检测,并直接基于目标检测结果生成对应的描述文本,但是这样会导致生成的描述文本较为分散,进而导致描述文本的语义不够连贯。
基于此,本申请实施例提出一种图像描述方法和装置、计算机设备、存储介质,能够通过获取原始图像;对原始图像进行特征提取,得到图像特征;根据图像特征对原始图像进行区域检测,得到目标区域图像;对目标区域图像进行特征提取,得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;通过词生成模型对主题数据进行单词预测,得到描述单词;根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。本申请实施例通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
本申请实施例提供图像描述方法和装置、计算机设备、存储介质,具体通过如下实施例进行说明,首先描述本申请实施例中的图像描述方法。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的图像描述方法,涉及人工智能领域。本申请实施例提供的图像描述方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现图像描述方法的应用等,但并不局限于以上形式。
本申请实施例可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
参照图1,根据本申请实施例第一方面实施例的图像描述方法,包括但不限于包括步骤S100至步骤S800。
步骤S100,获取原始图像;
步骤S200,对原始图像进行特征提取,得到图像特征;
步骤S300,根据图像特征对原始图像进行区域检测,得到目标区域图像;
步骤S400,对目标区域图像进行特征提取,得到区域特征向量;
步骤S500,通过主题生成模型对区域特征向量进行提取处理,得到主题数据;
步骤S600,通过词生成模型对主题数据进行单词预测,得到描述单词;
步骤S700,根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本。
在一些实施例的步骤S100中,获取原始图像,原始图像指的是需要进行文字描述的图像,本申请实施例的目的在于,将原始图像的内容翻译成文字,即目标描述文本。
在一些实施例的步骤S200中,对原始图像进行特征提取,得到图像特征。具体地,可以利用编码器,例如VGG模型中的VGG-16网络对原始图像进行特征提取。在实际应用中,将原始图像
Figure PCTCN2022090723-appb-000001
输入至VGG-16网络中进行特征提取处理就能够得到该模型输出的特征图像,即
Figure PCTCN2022090723-appb-000002
其中
Figure PCTCN2022090723-appb-000003
其中,W、H、C都是图像尺寸,W表示宽度,H表示高度,C表示通道数,W′为经过VGG-16网络后的宽度,H’为经过VGG-16网络后的高度。
具体地,本领域技术人员可根据实际需求设置VGG-16网络中具体网络层以及网络层的个数,例如可将本申请实施例的VGG-16模型的具体网络层以及网络层的个数设置为:13个卷积层、3个全连接层和5个池化层。
除此之外,在该模型中,本申请实施例还可以考虑移除最后1个池化层。需要说明的是,池化层的作用是减小参数矩阵的尺寸,例如可以使参数矩阵的宽度和高度都缩小为原先的一半,网络通道数不改变。在通常情况下,在神经网络中增加池化层能够在图像压缩的过程中过滤掉一些不重要的信息,但是由于本申请实施例需要对原始图像进行区域检测,所以可以考虑去掉1个池化层来保留更多的图像特征,以提高区域检测的准确率。
在一些实施例的步骤S300中,根据图像特征对原始图像进行区域检测,得到目标区域图像,具体地,需要将图像特征输入至RPN中进行区域检测,该模型将自动根据图像特征划分出至少一个目标区域,以及目标区域对应的目标区域图像。
在一些实施例的步骤S400中,对目标区域图像进行特征提取,得到区域特征向量,区域特征向量用于生成目标描述文本。
在一些实施例的步骤S500中,通过主题生成模型对区域特征向量进行提取处理,得到主题数据,通过主题生成模型生成的主题句数据,能够决定生成的图像描述中共包含多少个句子,即包含多少个主题;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息,主题词向量和时刻状态信息用于后续进行单词预测,并生成句子对应的单词。
在一些实施例的步骤S600中,通过词生成模型对主题数据进行单词预测,得到句子对应的多个描述单词。
在一些实施例的步骤S700中,根据时刻状态信息对每个句子中的每个描述单词进行拼接处理,得到生成好的主题句,将多个主题句进行拼接形成目标描述文本,其中,目标描述文本用于描述原始图像。
在实际应用中,设通过主题生成模型进行提取处理生成的主题数据包括4个主题词向量,例如主题向量1、主题向量2、主题向量3和主题向量4,则词生成模型会根据以上4个主题词向量进行单词预测,例如将主题向量1进行单词预测可生成的多个描述单词为“There”、“is”、“a”和“competition”,将以上描述单词进行拼接可生成主题句1,即“There is a competition”;将主题向量2进行单词预测可生成的多个描述单词为“Three”、“women”、“are”、“playing”、“field”和“hockey”,将以上描述单词进行拼接可生成主题句2,即“Three women are playing field hockey”;将主题向量3进行单词预测可生成的多个描述单词为“The”、“one”、“in”、“red”、“is”、“hitting”、“the”和“ball”,将以上描述单词进行拼接可生成主题句3,即“The one in red is hitting the ball”; 将主题向量4进行单词预测可生成的多个描述单词为“The”、“other”、“two”、“in”、“white”、“are”和“defending”,将以上描述单词进行拼接可生成主题句4,即“The other two in white are defending”;将上述生成的主题句1、主题句2、主题句3和主题句4进行拼接以形成最终的目标描述文本。
本申请实施例通过获取原始图像;对原始图像进行特征提取,得到图像特征;根据图像特征对原始图像进行区域检测,得到目标区域图像;对目标区域图像进行特征提取,得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;通过词生成模型对主题数据进行单词预测,得到描述单词;根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。本申请实施例通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
在一些实施例中,如图2所示,步骤S400具体包括但不限于步骤S410、步骤S420和步骤S430。
步骤S410,对目标区域图像进行预处理,得到初步区域图像;
步骤S420,对初步区域图像进行卷积处理,得到卷积特征向量;
步骤S430,对卷积特征向量进行池化处理,得到区域特征向量。
在一些实施例的步骤S410中,对目标区域图像进行预处理,得到满足预设条件的初步区域图像,其中,预设条件可以指特定的图像尺寸和图像大小等。
在一些实施例的步骤S420中,对初步区域图像进行卷积处理,得到卷积特征图像,具体地可以将初步区域图像的图像特征输入至一定数量的全连接层,例如两个全连接层中,由连接层将各个初步区域图像的图像特征进行卷积,并压缩成D维向量,记为
Figure PCTCN2022090723-appb-000004
需要说明的是,特征向量的维度与目标区域的个数,或者与初步区域图像的个数一致。
在一些实施例的步骤S430中,对卷积特征向量进行池化处理,得到区域特征向量,具体地,将D维卷积特征向量输入至池化层进行平均池化或者最大池化处理,保留个特征区域的相关信息,得到区域特征向量。
具体地,对卷积特征向量进行池化的过程如公式(1)所示:
Figure PCTCN2022090723-appb-000005
其中,v p为经过池化得到的区域特征向量,
Figure PCTCN2022090723-appb-000006
P表示卷积特征向量的个数,也表示区域特征向量的维度。
在一些实施例中,如图3所示,步骤S410具体包括但不限于步骤S411和步骤S412。
步骤S411,对目标区域图像进行特征映射,得到初步映射图像;
步骤S412,根据预设尺寸对初步映射图像进行尺寸变换,得到初步区域图像。
在一些实施例的步骤S411中,在对原始图像进行区域检测之后,还需要对目标区域图像进行特征映射,得到初步映射图像,也就是目标区域图像对应的特征图。
在一些实施例的步骤S412中,将初步映射图像的尺寸调整至预设尺寸,例如将初步映射图像的尺寸调整为X×Y,其中X为预设长度,Y为预设宽度,得到初步区域图像。
在一些实施例中,如图4所示,步骤S412具体包括但不限于步骤S4121、步骤S4122和步骤S4123。
步骤S4121,获取目标区域图像的第一坐标;
步骤S4122,根据第一坐标和预设尺寸计算得到第二坐标;
步骤S4123,根据第二坐标调整初步映射图像的尺寸,得到初步区域图像。
在一些实施例的步骤S4121中,获取目标区域图像的第一坐标,具体地,设目标区域图像为I′,第一坐标指的是目标区域图像I′中任意一点的坐标(x′ i,j,y′ i,j)。
在一些实施例的步骤S4122中,根据第一坐标和预设尺寸,计算初步映射图像到目标区域图像的反向投影坐标,即第二坐标。具体地,对于初步映射图像I″中任一点坐标(x″ i,j,y″ i,j)投影到目标区域图像I′中的坐标值为
Figure PCTCN2022090723-appb-000007
再利用双线性插值法计算I′中(x′ i,j,y′ i,j)坐标点的像素,该像素即为I″中对应点的像素I″ c,i,j,即第二坐标,计算公式(2)如下:
Figure PCTCN2022090723-appb-000008
其中,k(d)=max(0,1-|d|),表示两点之间的距离。
在一些实施例的步骤S4123中,根据第二坐标调整初步映射图像的尺寸,得到初步区域图像。具体地,在确定第二坐标之后,根据第二坐标对初步预设图像进行缩放,例如将预设图像的某一点的坐标调整至第二坐标的位置,或参照第二坐标的位置调整预设图像的坐标等。
在一些实施例中,如图5所示,步骤S500具体包括但不限于步骤S510、步骤S520和步骤S530。
步骤S510,将区域特征向量输入至主题生成模型;其中,主题生成模型包括循环层和隐藏层;
步骤S520,通过循环层对区域特征向量进行循环迭代处理,得到主题词向量;
步骤S530,根据主题词向量从隐藏层中获取时刻状态信息。
在一些实施例的步骤S510中,将区域特性向量输入至主题生成模型;其中,主题生成模型包括循环层和隐藏层,循环层用于生成主题词向量,隐藏层用于输出时刻状态信息;在实际应用中,本申请实施例的主题生成模块可采用单层的LSTM模型。
在一些实施例的步骤S520中,通过主题生成模型的循环层,对区域特征向量进行循环迭代处理,以生成主题词向量,即通过LSTM模型生成的主题词向量决定最终的图像描述包含的句子个数,即主题个数。
在一些实施例的步骤S530中,根据主题词向量从LSTM的隐藏层中获取时刻状态信息,其中,时刻状态信息指的是生成该主题词向量所对应的时刻中,LSTM模型中隐藏层的隐藏状态。设隐藏层中每时刻的隐藏状态为
Figure PCTCN2022090723-appb-000009
H为隐藏层大小。在本申请实施例中,对于隐藏状态有两种作用,其一是:通过对隐藏层进行分类来决定该时刻生成的这句话是否为该图像生成描述中的最后一句话,记为
Figure PCTCN2022090723-appb-000010
其二是:该时刻隐藏状态将作为词生成模型的输入,以生成对应该主题句
Figure PCTCN2022090723-appb-000011
的各个P维单词向量
Figure PCTCN2022090723-appb-000012
在一些实施例中,如图6所示,步骤S600具体包括但不限于步骤S610、步骤S620、步骤S630和步骤S640。
步骤S610,获取词生成模型的激活函数;
步骤S620,将主题词向量和时刻状态信息输入至词生成模型;
步骤S630,根据激活函数对主题词向量和时刻状态信息进行计算,得到至少一个候选单词;
步骤S640,从候选单词中获取描述单词。
在一些实施例的步骤S610中,获取词生成模型的激活函数,其中词生成模型用于根据主题数据生成对应的主题句的描述单词。在实际应用中,词生成模块可采用两层的GRU模型。
在一些实施例的步骤S620中,将主题词向量和时刻状态信息作为输入,输入至词生成模型中。
在一些实施例的步骤S630中,根据激活函数对主题词向量和时刻状态信息进行计算,得到至少一个候选单词。具体地,词生成模型的初始输入为主题生成模型在该时刻输出的主题词向量,并且将词生成模型中隐藏层在每一时刻对应的隐藏状态输入至词生成模型中。需要 说明的是,计算候选单词的过程如公式(3)、公式(4)和公式(5)所示:
Figure PCTCN2022090723-appb-000013
Figure PCTCN2022090723-appb-000014
Figure PCTCN2022090723-appb-000015
其中,x -1代表句级LSTM生成的P维主题词向量,将其作为GRU模型的初始输入,S t代表GRU模型生成的候选单词,S 0为开始标志,
Figure PCTCN2022090723-appb-000016
为前一时刻主题生成模型中隐藏层的隐藏状态,p t+1代表第t+1个候选单词在整个预设单词集中的分布率。
在一些实施例的步骤S640中,从候选单词中选取描述单词,用于生成主题句。
在一些实施例中,如图7所示,步骤S640具体包括但不限于步骤S641和步骤S642。
步骤S641,计算每一候选单词在预设单词集的分布概率;
步骤S642,获取分布概率最大的候选单词,作为描述单词。
在一些实施例的步骤S641中,计算每一候选单词在预设单词集中的分布概率,具体地,可以通过公式(6)进行计算:
p t+1=p(S t+1|I,S 0,...,S t)(6)
其中,p t+1代表第t+1个候选单词在整个预设单词集中的分布率,I表示预设单词集。
在一些实施例的步骤S642中,获取分布概率最大的候选单词,作为描述单词。具体地,选取分布概率最大的候选单词,作为对应句子中第t个单词的输出,直到概率最大的候选单词的隐藏状态对应有结束标志,则该句子生成结束,迭代终止。此外,在每个GRU模型生成各自主题句的描述单词后,将这些主题句根据时刻状态信息进行连接,形成总的图像描述段落。
在实际应用中,主题生成模型和词生成模型可以形成一个总的图像描述模型,整个图像描述模型的损失函数为主题生成模型和刺生成模型的分层循环网络对应的加权和。此外,根据计算出的损失值还能够采用梯度下降算法来更新主题生成模型的模型参数,以此得到训练好的主题生成模型,进一步提高图像描述的准确率。
本申请实施例提出的图像描述方法,通过获取原始图像;对原始图像进行特征提取,得到图像特征;根据图像特征对原始图像进行区域检测,得到目标区域图像;对目标区域图像进行特征提取,得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;通过词生成模型对主题数据进行单词预测,得到描述单词;根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。本申请实施例通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
在一些实施例中,如图8所示,还提供一种图像描述装置800,可以实现上述图像描述方法,该图像描述装置800包括:图像获取模块810、第一特征提取模块820、区域检测模块830、第二特征提取模块840、数据提取模块850、单词预测模块860和单词拼接模块870,其中图像获取模块810用于获取原始图像;第一特征提取模块820用于对原始图像进行特征提取,得到图像特征;区域检测模块830用于根据图像特征对原始图像进行区域检测,得到目标区域图像;第二特征提取模块840用于对目标区域图像进行特征提取,得到区域特征向量;数据提取模块850用于通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;单词预测模块860用于通过词生成模型对主题数据进行单词预测,得到描述单词;单词拼接模块870用于根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。
本申请实施例的图像描述装置800用于执行上述实施例中的图像描述方法,其具体处理过程与上述实施例中的图像描述方法相同,此处不再一一赘述。
本申请实施例提出的图像描述装置800,通过获取原始图像;对原始图像进行特征提取,得到图像特征;根据图像特征对原始图像进行区域检测,得到目标区域图像;对目标区域图像进行特征提取,得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;通过词生成模型对主题数据进行单词预测,得到描述单词;根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。本申请实施例通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
本申请实施例还提供了一种计算机设备,包括:
至少一个处理器,以及,
与至少一个处理器通信连接的存储器;其中,
存储器存储有指令,指令被至少一个处理器执行,以使至少一个处理器执行指令时实现一种图像描述方法,其中,所述图像描述方法包括:
获取原始图像;
对原始图像进行特征提取,得到图像特征;
根据图像特征对原始图像进行区域检测,得到目标区域图像;
对目标区域图像进行特征提取,得到区域特征向量;
通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;
通过词生成模型对主题数据进行单词预测,得到描述单词;
根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。
下面结合图9对计算机设备的硬件结构进行详细说明。该计算机设备包括:处理器910、存储器920、输入/输出接口930、通信接口940和总线950。
处理器910,可以采用通用的中央处理器(Central Processin Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;
存储器920,可以采用只读存储器(Read Only Memory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(Random Access Memory,RAM)等形式实现。存储器920可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器920中,并由处理器910来调用执行本申请实施例的图像描述方法;
输入/输出接口930,用于实现信息输入及输出;
通信接口940,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和
总线950,在设备的各个组件(例如处理器910、存储器920、输入/输出接口930和通信接口940)之间传输信息;
其中处理器910、存储器920、输入/输出接口930和通信接口940通过总线950实现彼此之间在设备内部的通信连接。
本申请实施例还提供一种存储介质,该存储介质是计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行一种图像描述方法,其中,所述图像描述方法包括:
获取原始图像;
对原始图像进行特征提取,得到图像特征;
根据图像特征对原始图像进行区域检测,得到目标区域图像;
对目标区域图像进行特征提取,得到区域特征向量;
通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;
通过词生成模型对主题数据进行单词预测,得到描述单词;
根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。
所述计算机可读存储介质可以是非易失性,也可以是易失性。存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本申请实施例提出的图像描述方法和装置、计算机设备、存储介质,通过获取原始图像;对原始图像进行特征提取,得到图像特征;根据图像特征对原始图像进行区域检测,得到目标区域图像;对目标区域图像进行特征提取,得到区域特征向量;通过主题生成模型对区域特征向量进行提取处理,得到主题数据;其中,主题数据包括主题词向量和对应主题词向量的时刻状态信息;通过词生成模型对主题数据进行单词预测,得到描述单词;根据时刻状态信息对每一描述单词进行拼接处理,得到目标描述文本;其中,目标描述文本用于描述原始图像。本申请实施例通过多次特征提取,能够使生成的目标描述文本包含更多图像细节;此外,在对原始图像进行区域检测的基础上,依次利用主题生成模型和词生成模型分层次地生成目标描述文本,能够生成具有连贯语义的描述文本。
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1至图7中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例除了能够在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的 方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序的介质。
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。

Claims (20)

  1. 一种图像描述方法,其中,所述方法包括:
    获取原始图像;
    对所述原始图像进行特征提取,得到图像特征;
    根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
    对所述目标区域图像进行特征提取,得到区域特征向量;
    通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
    通过词生成模型对所述主题数据进行单词预测,得到描述单词;
    根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
  2. 根据权利要求1所述的方法,其中,所述对所述目标区域图像进行特征提取,得到区域特征向量,包括:
    对所述目标区域图像进行预处理,得到初步区域图像;
    对所述初步区域图像进行卷积处理,得到卷积特征向量;
    对所述卷积特征向量进行池化处理,得到所述区域特征向量。
  3. 根据权利要求2所述的方法,其中,对所述目标区域图像进行预处理,得到初步区域图像,包括:
    对所述目标区域图像进行特征映射,得到初步映射图像;
    根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像。
  4. 根据权利要求3所述的方法,其中,所述根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像,包括:
    获取所述目标区域图像的第一坐标;
    根据所述第一坐标和所述预设尺寸计算得到第二坐标;
    根据所述第二坐标调整所述初步映射图像的尺寸,得到所述初步区域图像。
  5. 根据权利要求1至4任一项所述的方法,其中,所述通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据,包括:
    将所述区域特征向量输入至所述主题生成模型;其中,所述主题生成模型包括循环层和隐藏层;
    通过所述循环层对所述区域特征向量进行循环迭代处理,得到所述主题词向量;
    根据所述主题词向量从所述隐藏层中获取所述时刻状态信息。
  6. 根据权利要求5所述的方法,其中,所述通过词生成模型对所述主题数据进行单词预测,得到描述单词,包括:
    获取所述词生成模型的激活函数;
    将所述主题词向量和所述时刻状态信息输入至所述词生成模型;
    根据所述激活函数对所述主题词向量和所述时刻状态信息进行计算,得到至少一个候选单词;
    从所述候选单词中获取所述描述单词。
  7. 根据权利要求6所述的方法,其中,所述从所述候选单词中获取所述描述单词,包括:
    计算每一所述候选单词在预设单词集的分布概率;
    获取分布概率最大的所述候选单词,作为所述描述单词。
  8. 一种图像描述装置,其中,所述装置包括:
    图像获取模块:用于获取原始图像;
    第一特征提取模块:用于对所述原始图像进行特征提取,得到图像特征;
    区域检测模块:用于根据所述图像特征对所述原始图像进行区域检测,得到目标区域图 像;
    第二特征提取模块:用于对所述目标区域图像进行特征提取,得到区域特征向量;
    数据提取模块:用于通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
    单词预测模块:用于通过词生成模型对所述主题数据进行单词预测,得到描述单词;
    单词拼接模块:用于根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时所述处理器用于执行一种图像描述方法,其中,所述图像描述方法包括:
    获取原始图像;
    对所述原始图像进行特征提取,得到图像特征;
    根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
    对所述目标区域图像进行特征提取,得到区域特征向量;
    通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
    通过词生成模型对所述主题数据进行单词预测,得到描述单词;
    根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述目标区域图像进行特征提取,得到区域特征向量,包括:
    对所述目标区域图像进行预处理,得到初步区域图像;
    对所述初步区域图像进行卷积处理,得到卷积特征向量;
    对所述卷积特征向量进行池化处理,得到所述区域特征向量。
  11. 根据权利要求10所述的计算机设备,其中,对所述目标区域图像进行预处理,得到初步区域图像,包括:
    对所述目标区域图像进行特征映射,得到初步映射图像;
    根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像。
  12. 根据权利要求11所述的计算机设备,其中,所述根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像,包括:
    获取所述目标区域图像的第一坐标;
    根据所述第一坐标和所述预设尺寸计算得到第二坐标;
    根据所述第二坐标调整所述初步映射图像的尺寸,得到所述初步区域图像。
  13. 根据权利要求9至12任一项所述的计算机设备,其中,所述通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据,包括:
    将所述区域特征向量输入至所述主题生成模型;其中,所述主题生成模型包括循环层和隐藏层;
    通过所述循环层对所述区域特征向量进行循环迭代处理,得到所述主题词向量;
    根据所述主题词向量从所述隐藏层中获取所述时刻状态信息。
  14. 根据权利要求13所述的计算机设备,其中,所述通过词生成模型对所述主题数据进行单词预测,得到描述单词,包括:
    获取所述词生成模型的激活函数;
    将所述主题词向量和所述时刻状态信息输入至所述词生成模型;
    根据所述激活函数对所述主题词向量和所述时刻状态信息进行计算,得到至少一个候选单词;
    从所述候选单词中获取所述描述单词。
  15. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储有计算机程序,在所述计算机程序被计算机执行时,所述计算机用于执行一种图像描述方法,其中,所述图像描述方法包括:
    获取原始图像;
    对所述原始图像进行特征提取,得到图像特征;
    根据所述图像特征对所述原始图像进行区域检测,得到目标区域图像;
    对所述目标区域图像进行特征提取,得到区域特征向量;
    通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据;其中,所述主题数据包括主题词向量和对应所述主题词向量的时刻状态信息;
    通过词生成模型对所述主题数据进行单词预测,得到描述单词;
    根据所述时刻状态信息对每一所述描述单词进行拼接处理,得到目标描述文本;其中,所述目标描述文本用于描述所述原始图像。
  16. 根据权利要求15所述的存储介质,其中,所述对所述目标区域图像进行特征提取,得到区域特征向量,包括:
    对所述目标区域图像进行预处理,得到初步区域图像;
    对所述初步区域图像进行卷积处理,得到卷积特征向量;
    对所述卷积特征向量进行池化处理,得到所述区域特征向量。
  17. 根据权利要求16所述的存储介质,其中,对所述目标区域图像进行预处理,得到初步区域图像,包括:
    对所述目标区域图像进行特征映射,得到初步映射图像;
    根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像。
  18. 根据权利要求17所述的存储介质,其中,所述根据预设尺寸对所述初步映射图像进行尺寸变换,得到所述初步区域图像,包括:
    获取所述目标区域图像的第一坐标;
    根据所述第一坐标和所述预设尺寸计算得到第二坐标;
    根据所述第二坐标调整所述初步映射图像的尺寸,得到所述初步区域图像。
  19. 根据权利要求15至18任一项所述的存储介质,其中,所述通过主题生成模型对所述区域特征向量进行提取处理,得到主题数据,包括:
    将所述区域特征向量输入至所述主题生成模型;其中,所述主题生成模型包括循环层和隐藏层;
    通过所述循环层对所述区域特征向量进行循环迭代处理,得到所述主题词向量;
    根据所述主题词向量从所述隐藏层中获取所述时刻状态信息。
  20. 根据权利要求19所述的存储介质,其中,所述通过词生成模型对所述主题数据进行单词预测,得到描述单词,包括:
    获取所述词生成模型的激活函数;
    将所述主题词向量和所述时刻状态信息输入至所述词生成模型;
    根据所述激活函数对所述主题词向量和所述时刻状态信息进行计算,得到至少一个候选单词;
    从所述候选单词中获取所述描述单词。
PCT/CN2022/090723 2022-03-22 2022-04-29 图像描述方法和装置、计算机设备、存储介质 WO2023178801A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210283244.XA CN114677520A (zh) 2022-03-22 2022-03-22 图像描述方法和装置、计算机设备、存储介质
CN202210283244.X 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023178801A1 true WO2023178801A1 (zh) 2023-09-28

Family

ID=82074932

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090723 WO2023178801A1 (zh) 2022-03-22 2022-04-29 图像描述方法和装置、计算机设备、存储介质

Country Status (2)

Country Link
CN (1) CN114677520A (zh)
WO (1) WO2023178801A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444968A (zh) * 2020-03-30 2020-07-24 哈尔滨工程大学 一种基于注意力融合的图像描述生成方法
CN111753078A (zh) * 2019-07-12 2020-10-09 北京京东尚科信息技术有限公司 图像段落描述生成方法、装置、介质及电子设备
CN113035311A (zh) * 2021-03-30 2021-06-25 广东工业大学 一种基于多模态注意力机制的医学图像报告自动生成方法
CN113468357A (zh) * 2021-07-21 2021-10-01 北京邮电大学 一种图像描述文本生成方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783666B (zh) * 2019-01-11 2023-05-23 中山大学 一种基于迭代精细化的图像场景图谱生成方法
CN110717498A (zh) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 图像描述生成方法、装置及电子设备
CN114037831B (zh) * 2021-07-20 2023-08-04 星汉智能科技股份有限公司 图像深度密集描述方法、系统及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753078A (zh) * 2019-07-12 2020-10-09 北京京东尚科信息技术有限公司 图像段落描述生成方法、装置、介质及电子设备
CN111444968A (zh) * 2020-03-30 2020-07-24 哈尔滨工程大学 一种基于注意力融合的图像描述生成方法
CN113035311A (zh) * 2021-03-30 2021-06-25 广东工业大学 一种基于多模态注意力机制的医学图像报告自动生成方法
CN113468357A (zh) * 2021-07-21 2021-10-01 北京邮电大学 一种图像描述文本生成方法及装置

Also Published As

Publication number Publication date
CN114677520A (zh) 2022-06-28

Similar Documents

Publication Publication Date Title
Shanmugamani Deep Learning for Computer Vision: Expert techniques to train advanced neural networks using TensorFlow and Keras
WO2020200030A1 (zh) 神经网络的训练方法、图像处理方法、图像处理装置和存储介质
WO2022089360A1 (zh) 人脸检测神经网络及训练方法、人脸检测方法、存储介质
WO2020228446A1 (zh) 模型训练方法、装置、终端及存储介质
WO2020182121A1 (zh) 表情识别方法及相关装置
AU2019374875B2 (en) Identifying image aesthetics using region composition graphs
CN112990054A (zh) 紧凑的无语言面部表情嵌入和新颖三元组的训练方案
CN114049381A (zh) 一种融合多层语义信息的孪生交叉目标跟踪方法
CN113994341A (zh) 面部行为分析
CN111079374B (zh) 字体生成方法、装置和存储介质
US20220101121A1 (en) Latent-variable generative model with a noise contrastive prior
CN111581926A (zh) 文案生成方法、装置、设备和计算机可读存储介质
CN114936623A (zh) 一种融合多模态数据的方面级情感分析方法
CN111091010A (zh) 相似度确定、网络训练、查找方法及装置和存储介质
KR20190130179A (ko) 미세한 표정변화 검출을 위한 2차원 랜드마크 기반 특징점 합성 및 표정 세기 검출 방법
US20230153965A1 (en) Image processing method and related device
CN108268629B (zh) 基于关键词的图像描述方法和装置、设备、介质
US20160210502A1 (en) Method and apparatus for determining type of movement of object in video
Mousas et al. Learning motion features for example-based finger motion estimation for virtual characters
CN111445545B (zh) 一种文本转贴图方法、装置、存储介质及电子设备
Arun Prasath et al. Prediction of sign language recognition based on multi layered CNN
Abdallah et al. Facial-expression recognition based on a low-dimensional temporal feature space
WO2023178801A1 (zh) 图像描述方法和装置、计算机设备、存储介质
KR20210041856A (ko) 딥 러닝 기반으로 애니메이션 캐릭터를 학습하는 데 필요한 학습 데이터 생성 방법 및 장치
CN114692715A (zh) 一种样本标注方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932858

Country of ref document: EP

Kind code of ref document: A1