WO2021052358A1 - 图像处理方法、装置及电子设备 - Google Patents

图像处理方法、装置及电子设备 Download PDF

Info

Publication number
WO2021052358A1
WO2021052358A1 PCT/CN2020/115559 CN2020115559W WO2021052358A1 WO 2021052358 A1 WO2021052358 A1 WO 2021052358A1 CN 2020115559 W CN2020115559 W CN 2020115559W WO 2021052358 A1 WO2021052358 A1 WO 2021052358A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
word vector
network
reflection
Prior art date
Application number
PCT/CN2020/115559
Other languages
English (en)
French (fr)
Inventor
柯磊
裴文杰
李睿宇
沈小勇
戴宇荣
贾佳亚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20866551.3A priority Critical patent/EP3998552A4/en
Priority to JP2021564175A priority patent/JP7164252B2/ja
Publication of WO2021052358A1 publication Critical patent/WO2021052358A1/zh
Priority to US17/517,004 priority patent/US11907637B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and in particular to an image processing method, an image description generating device, and electronic equipment.
  • Image generation description is the analysis and research of generating a natural language description that can express its meaning for a picture, and it has a wide range of application prospects. For example, generating a text description of a picture can help visually impaired people understand the content of the image quickly and accurately; in the field of preschool education, generating an intuitive and accurate description of children’s pictures can give children better enlightenment and learning.
  • the image generation description mainly uses convolutional neural network to express the image encoding with a fixed vector, and then directly uses the recurrent neural network to decode it into a sentence describing the content.
  • the existing decoding model is relatively simple, which leads to a significant decrease in the effect of the model when the sentence is long or the sentence structure is more complicated.
  • the embodiments of the present disclosure provide an image processing method, an image processing device, and an electronic device, which can accurately and effectively extract natural language information contained in an image at least to a certain extent, and generate a more accurate and smooth text description.
  • an image processing method including: acquiring an input image, encoding an object contained in each image region in the input image to acquire a first image feature; and according to a preset rule The pixels in the first image feature are processed, and the second image feature is determined according to the processed pixels; based on the second image feature and the starting word vector, the first image feature and the The region feature corresponding to each image region is decoded to obtain a word vector corresponding to each image region, and a text description corresponding to the input image is formed according to the word vector, wherein the starting word vector is The start tag of the text description.
  • an image processing device including: a feature extraction module for acquiring an input image, and encoding an object contained in each image region in the input image to acquire a first image Features; feature conversion module, used to process pixels in the first image feature according to preset rules, and determine the second image feature based on the processed pixels; description generation module, used to based on the second image feature And the starting word vector, and decode the region features corresponding to each of the image regions in the first image feature at different times to obtain the word vector corresponding to each of the image regions, and form it according to the word vector
  • the technical solution of the present disclosure decodes the image features corresponding to the input image through the decoding network model. On the one hand, it can extract the natural language information contained in the input image more accurately and effectively; on the other hand, it can make the decoding network model have longer sentences or sentences. It can also be applied when the structure is more complicated, which improves the accuracy and fluency of the text description.
  • FIG. 1 is a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present disclosure can be applied;
  • Figure 2 is a schematic flow diagram of an image processing method in related technologies
  • Fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure
  • Fig. 4 is a schematic structural diagram of a reflection decoding network model according to an embodiment of the present disclosure
  • Fig. 5 is a schematic structural diagram of a visual attention module according to an embodiment of the present disclosure.
  • Fig. 6 is a schematic diagram of a processing flow of a visual attention module according to an embodiment of the present disclosure
  • Fig. 7 is a schematic flow chart of image processing according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a processing flow of a reflective attention module according to an embodiment of the present disclosure.
  • Fig. 9 is a schematic structural diagram of a reflective attention module according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of a process of determining a position perception loss by a reflective position module according to an embodiment of the present disclosure
  • FIG. 11 is a block diagram of an image processing device according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram of a computer system suitable for implementing the image processing apparatus of the embodiment of the present disclosure.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present disclosure can be applied.
  • the system architecture 100 may include a terminal device 101, a network 102, and a server 103.
  • the network 102 is used to provide a medium of a communication link between the terminal device 101 and the server 103.
  • the network 102 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to actual needs, there can be any number of terminal devices, networks and servers.
  • the server 103 may be a server cluster composed of multiple servers.
  • the terminal device 101 sends the image to the server 103 via the network 102.
  • the server 103 obtains the input image, it can first divide the input image to form multiple image regions, and then use the coding network model to analyze each image.
  • Feature extraction is performed on the objects in the area to obtain the area feature corresponding to each image area, and then the first image feature corresponding to the input image is obtained according to the area feature corresponding to each image area; then the first image feature is determined according to preset rules Process the pixels in the image, and determine the second image feature corresponding to the input image according to the processed pixels; then input the first image feature, the second image feature, and the starting word vector into the reflection decoding network model, and the reflection decoding network model The first image feature is decoded to obtain a word vector corresponding to each image area, and then a text description corresponding to the input image is formed according to the word vector corresponding to each image area.
  • the technical solution of the embodiment of the present disclosure can ensure the performance of the model when the sentence is long or the sentence structure is more complicated, and can more accurately and effectively extract the natural language information contained in the image, and generate a more accurate and smooth text description.
  • the image processing method provided by the embodiments of the present disclosure is generally executed by a server, and correspondingly, the image processing device is generally set in the server. However, in other embodiments of the present disclosure, the image processing method provided by the embodiments of the present disclosure may also be executed by a terminal device.
  • FIG. 2 shows a schematic flow diagram of the image processing method in the related art.
  • the image 201 is input to the coding network model 202.
  • the coding network model 202 includes the Faster R-CNN network and the ResNet-101 network. By extracting features from the input image through the Faster R-CNN network, the local feature information corresponding to each object in the input image can be obtained, and the input image can be extracted through the ResNet-101 network.
  • the feature can obtain the global feature information corresponding to the input image; then the local feature information and the global feature information are input to the decoding network model 203, the decoding network model 203 includes a plurality of repetitive network structures, the network structure is based on attention cycle neural
  • the network specifically, input the global feature information to the first layer LSTM, and perform feature extraction on the global feature information through the first layer LSTM to output the first hidden state; then the first hidden state and local feature information are input to the attention
  • the force mechanism network layer can output a mixed feature through the attention mechanism network layer; then the mixed feature and the first hidden state are processed together through the second layer of LSTM to output the second hidden state; finally, the second hidden state is processed Softmax processing to obtain the predicted word vector.
  • the image description generation algorithm shown in Figure 2 can achieve better results, it still has limitations. Specifically, the way to improve the effect of the model can only be by extracting more representative and fine-grained image features separated to the level of a single object, while ignoring the attention to the language model itself.
  • the decoding network model is relatively simple, which leads to a significant decrease in the effect of the model when the sentence is long or the sentence structure is more complicated.
  • the embodiments of the present disclosure provide an image processing method, which involves the field of artificial intelligence (Artificial Intelligence, AI) is the use of a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence and perceive the environment ,
  • Artificial Intelligence is the use of a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence and perceive the environment ,
  • Theories, methods, techniques and application systems for acquiring knowledge and using knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a technology that studies how to make machines "see”. To put it further, it refers to the use of cameras and computers to identify, track, and measure machine vision for targets, and further process graphics , So that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect images.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, learning from instruction and other technologies.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robots, intelligent medical care, intelligent customer service, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play more and more important values.
  • the embodiment of the present disclosure first proposes an image processing method, which can be applied in fields such as early childhood education, image retrieval, and navigation for the blind.
  • image processing method which can be applied in fields such as early childhood education, image retrieval, and navigation for the blind.
  • Fig. 3 is a flowchart of an image processing method according to an embodiment of the present disclosure.
  • the image processing method may be executed by one or more computing devices, and the one or more computing devices may be the terminal device 101 and/or the server 103 shown in FIG. 1.
  • the image processing method includes at least step S310 to step S330.
  • step S310 an input image is acquired, and objects contained in each image region in the input image are encoded to acquire a first image feature.
  • the input image may be an image downloaded from the Internet, or an image stored locally in the terminal device 101, or it may be an image that the user has through a photographing device, such as a camera, a video camera, a smart phone, etc.
  • the terminal of the photographing unit, the acquired image, etc. after determining that the image described in the text needs to be generated, can be sent to the server 103 through the terminal device 101.
  • the terminal device 101 may be any terminal device with a display screen, such as a smart phone, a notebook computer, a desktop computer, etc., which is not specifically limited in the embodiment of the present disclosure.
  • the input image after receiving the input image, can be divided to form multiple image regions.
  • the method of dividing the input image can be based on the number of pixels, or based on the number of pixels in the image.
  • the different objects are divided and so on.
  • the objects in each image region can be encoded, that is, feature extraction.
  • the scene presented by an image is a child playing a ball in the yard, then the image in the image
  • the objects are children, balls, and grass.
  • the background of the image such as the sky and birds, it can be ignored, and there is no need to perform feature extraction on the background.
  • network structures such as Faster R-CNN, ResNet, VGG, etc. can be used as the coding network model, and the object in each image area can be extracted through the coding network model to obtain and
  • the area feature corresponding to each image area is essentially a fixed vector expression corresponding to the image area. Further, the first image feature corresponding to the input image can be obtained according to the area feature corresponding to each image area.
  • step S320 the pixels in the first image feature are processed according to a preset rule, and the second image feature is determined according to the processed pixels.
  • the pixel value in the second image feature can be determined according to the pixel value of each pixel in the first image feature. Specifically, the pixel average value of all pixels in the first image feature can be calculated, and the pixel average value can be used as the pixel value of each pixel in the second image feature.
  • the second image feature can be input to the reflection decoding network model as an input feature, so that the reflection decoding network model decodes the first image feature according to the second image feature and the starting word vector, and predicts the same as each image in the first image feature.
  • the word vector corresponding to the region.
  • the starting word vector in the embodiment of the present disclosure can be any character without substantial semantics, for example, it can be a starting tag, such as #, or it can be a starting tag, such as BN, And so on, the embodiment of the present disclosure does not specifically limit this.
  • step S330 based on the second image feature and the starting word vector, the region feature corresponding to each of the image regions in the first image feature is decoded to obtain the word corresponding to each of the image regions And form a text description corresponding to the input image according to the word vector, wherein the starting word vector is the starting mark of the text description.
  • step S330 the region features corresponding to each of the image regions in the first image features can be decoded at different times, and the current region features can be decoded by using the previously decoded region features.
  • the second image feature is input as an input feature into the reflection decoding network model, and the starting word vector can also be input into the reflection decoding network model , So that it decodes the area features corresponding to each image area in the first image feature at different times, and obtains the word vector corresponding to each image area.
  • Fig. 4 shows a schematic structural diagram of the reflection decoding network model.
  • the reflection decoding network model includes a plurality of sequentially arranged reflection decoding sub-networks. The area feature corresponding to each image area is decoded to obtain the word vector corresponding to each image area.
  • the second image feature and the starting word vector can be input as the input feature, and the first reflection decoding sub-network can compare the first image feature based on the second image feature and the starting word vector.
  • the target region feature of the target region is decoded to obtain the word vector corresponding to the target region feature; for the M+1th reflection decoding sub-network, the second image feature and the word vector output by the Mth reflection decoding sub-network can be input to the first
  • the M+1 reflection decoding sub-network decodes the target area feature in the first image feature through the M+1th reflection decoding sub-network to obtain the word vector corresponding to the target area feature, where M is a positive integer.
  • the reflection decoding sub-networks have the same structure, and all include a visual attention module, a reflective attention module RAM (Reflective Attention Module), and a reflective position module RPM (Reflective Position Module).
  • the visual attention module mainly focuses on the visual features of the coding network model.
  • the reflective attention module uses the text attention mechanism to model the matching degree of the output information of the visual attention module at the current moment and the past moment, and obtain the context vector to generate the words at the current moment. So as to be able to capture more comprehensive historical vocabulary information.
  • the reflective position module can introduce the relative position information of each word in the generated text description. While the reflection decoding network model predicts the vocabulary, it also predicts the relative position of the current vocabulary in the text description, thereby helping the reflection decoding network model to perceive the sentence syntax structure.
  • Figure 5 shows a schematic structural diagram of the visual attention module.
  • the visual attention module 500 includes a first long short-term memory network (LSTM-1) 501 and a second long short-term memory network (LSTM-2) 502 And attention mechanism network (Attvis) 503, in which the first long short-term memory network 501 is used for feature extraction based on the second image feature and the word vector obtained at the previous moment, and the second long short-term memory network 502 is used for feature extraction based on the first long-term
  • the output information of the short-term memory network 501 and the output information of the attention mechanism network 503 perform feature extraction, and the attention mechanism network 503 is used to perform feature extraction based on the first image feature and the output information of the first long short-term memory network 501.
  • FIG. 6 shows a schematic diagram of the processing flow of the visual attention module.
  • the processing flow of the visual attention module in the t-th reflection decoding sub-network is taken as an example for description in the embodiments of the present disclosure, as shown in FIG.
  • the processing flow of the visual attention module includes at least steps S601-S604, specifically:
  • step S601 the word vector output by the reflection decoding sub-network at the previous moment is multiplied by the first weight matrix to obtain the target word vector.
  • FIG. 7 shows a schematic diagram of the image processing flow.
  • the second image feature determined according to the first image feature The word vector output by the reflection decoding sub-network at the previous moment is the input feature of LSTM-1.
  • step S602 feature extraction is performed on the second image feature and the target word vector through the first long and short-term memory network to obtain the first output information.
  • LSTM-1 processes the target word vector and the second image feature to output the first output information.
  • the output information is essentially the hidden state (Hidden state) output by LSTM-1, as shown in Figure 7
  • step S603 the first output information and the first image feature are input to the attention mechanism network for visual matching to obtain the target area feature.
  • the attention mechanism is similar to human vision, which can selectively focus on a part of all information while ignoring other visible information.
  • the regional feature with the highest matching degree is output from the attention mechanism network as the target region feature, as shown in Figure 7
  • step S604 feature extraction is performed on the first output information and the features of the target area through the second long and short-term memory network to obtain the second output information.
  • the target area feature and the first output information will be input to LSTM-2 as input features, and LSTM-2 can perform feature extraction on the first output information and the target area feature ,
  • the second output information is the hidden state output by LSTM-2, as shown in Figure 7
  • LSTM-2 Long Short-Term Memory
  • some embodiments may determine the word vector corresponding to the target area feature through the first hidden state and the second hidden state.
  • the embodiment of the present disclosure when the sentence is long or the sentence structure is more complex, in order to improve the decoding effect, the embodiment of the present disclosure first proposes to adopt the reflective attention module to use the text attention mechanism to hide the current moment of the hidden state and The hidden state of the past moment is matched.
  • the reflective attention module RAM in the t-th reflective decoding sub-network in addition to receiving the second output information from the corresponding LSTM-2, it also receives the first to (t-1) Reflect the second output information output by LSTM-2 in the reflection decoding sub-network and the first output information output by LSTM-1 corresponding to it, so as to base on the second output information at the past moment and the first output information and the second output at the current moment.
  • the information determines the third output information corresponding to the target area feature at the current moment.
  • Fig. 8 shows a schematic diagram of the processing flow of the reflective attention module. As shown in Fig. 8, the processing flow at least includes steps S801-S805, specifically:
  • step S801 the target matrix is determined according to the second output information of all past moments and the second output information of the current moment.
  • FIG. 9 shows a schematic structural diagram of a reflective attention module.
  • the column in the upper left corner represents the second output information, and according to the second output information at the past moment And the second output information at the current moment
  • a target matrix with corresponding dimensions can be formed, for example, it can be a target matrix of 1000 ⁇ 1.
  • step S802 dimensionality reduction processing is performed on the target matrix to obtain first feature information, and dimensionality reduction processing is performed on the first output information at the current moment to obtain second feature information, where the first feature information and the second feature are The dimensions of the information are the same.
  • the target matrix and the first output information at the current moment may be subjected to dimensionality reduction processing to obtain first feature information and second feature information having the same dimensions, respectively.
  • the target matrix and the first output information at the current moment can be multiplied by a 512-dimensional weight matrix, respectively, so that the dimensions of the target matrix and the first output information are reduced from 1000 to 512, which is greatly reduced. Improve the processing efficiency.
  • step S803 the first feature information and the second feature information are added based on the attention mechanism to obtain third feature information.
  • the first feature information and the second feature information can be processed accordingly.
  • Attref can specifically combine the first feature information and the second feature information.
  • other specific processing methods can also be used for the addition of characteristic information, which is not specifically limited in the embodiment of the present disclosure.
  • step S804 weight processing and normalization processing are performed on the third feature information to obtain a second weight matrix.
  • the third feature information can be multiplied by the reflected attention weight Wr to obtain a feature matrix.
  • the amount of information contained in the feature matrix is the same as the target matrix.
  • the number of the second output information is the same, which is t; then you can perform softmax processing on the feature matrix, that is, normalization processing, calculate the ratio of each information to all the information, according to the corresponding to each second output information The ratio can determine the second weight matrix.
  • step S805 the first feature information and the second weight matrix are multiplied and summed to obtain the third output information.
  • the first feature information determined according to all the second output information can be multiplied by the second weight matrix and summed To obtain the third output information, as shown in the right column in Figure 9
  • the third output information may be multiplied by the third weight matrix Ws to obtain the characteristics of the target area.
  • the corresponding word vector is St as shown in Figure 7. It is worth noting that the word vector St output at time t is the input vector Ot+1 at time t+1.
  • the third output information can be input to the reflection position module at the same time, and the reflection position module can be based on the third output information. Predict the relative position of the word vector output at the current moment in the text description.
  • the reflection position module includes a fully connected layer and a compression layer. After the third output information is input to the reflection position module, it is first fully connected through the fully connected layer, and the 512 ⁇ 1 dimensional It is converted into a 1 ⁇ 1 dimensional vector, and then the vector output by the fully connected layer is compressed by the compression layer according to the corresponding compression function to obtain a relative position.
  • the output of the compression layer is a number between 0 and 1, which represents the position of the word vector in the text description.
  • the text description is a sentence containing 10 words, and the number output by the compression layer is 0.6.
  • the position of the word vector St output by the t-th reflection decoding sub-network in the sentence is the 6th position.
  • a plurality of sequentially arranged reflection decoding sub-networks in the reflection decoding network model are used to decode the area features corresponding to each image area in the first image feature, and the generation is stopped when the sentence ending punctuation is encountered.
  • the word vector after obtaining the word vector ⁇ S1, S2,...,ST ⁇ corresponding to each image area, these word vectors can be concatenated in sequence to form a text description corresponding to the input image.
  • the reflection decoding network model before using the reflection decoding network model to perform vocabulary prediction on the first image feature to generate a text description, the reflection decoding network model also needs to be trained. Specifically, first obtain an image sample and a text description sample corresponding to the image sample, and then input the image sample into the reflection decoding network model to be trained to generate a corresponding text description, and match the generated text description with the corresponding text description sample Adjust the model parameters to a degree until the loss function of the reflection decoding network model to be trained is the smallest.
  • the loss function of the reflection decoding network model includes two parts: a cross-entropy loss function and a position-aware loss function.
  • the cross-entropy loss function is the correctness of the text description corresponding to the image sample generated by the reflection-type decoding network to be trained. Probability; the position-aware loss function is the distance between the real position and the predicted position of the word vector output by the reflective decoding network model to be trained at the current moment in the text description sample.
  • the cross-entropy loss function in order to minimize the loss function of the reflection decoding network model, the cross-entropy loss function must be maximized and the position-aware loss function must be minimized.
  • the cross-entropy loss function can be determined according to formula (1), specifically:
  • I is the input image
  • is the parameter of the reflection decoding network model, including the weight matrices such as We, Ws, and Wr in the above embodiment
  • S is the correct text description corresponding to the input image with variable length, which can represent any sentence .
  • the chain rule can be used to model the joint probability distribution of sentence composition word vectors S1, S2,..., ST. Furthermore, based on formula (1), the cross-entropy loss function Lxe can be determined as shown in formula (2):
  • N is the number of words contained in the generated text description
  • St represents the word vector generated at time t.
  • (S, I) is the training image sentence pair, and the sum of logarithmic probability in formula (2) can be optimized by stochastic gradient descent (SGD).
  • the position-perceptive loss can be determined by the reflective position module.
  • FIG. 10 shows a schematic diagram of the process of determining the position-perceptive loss by the reflective position module. As shown in FIG.
  • the layer fully connects the third output information output by the reflective attention module to generate fully connected information, which can be a 1 ⁇ 1 vector; then compresses the fully connected information according to the preset compression function corresponding to the compression layer Processing to obtain the predicted position of the word vector corresponding to the third output information, that is, the relative position of the predicted word vector in the text description
  • the position perception loss is determined according to the predicted position and the real position of the word vector corresponding to the third output information in the text description sample, where the real position of the word in the sentence It can be obtained according to the number of words contained in the text description sample and the position of the vocabulary corresponding to the characteristics of the target area in the text description sample, and then according to the real position And relative position
  • the position perception loss Lpos can be determined, and the specific calculation method is shown in formula (3):
  • the size of the loss function corresponding to the reflection decoding network model can be determined according to formula (4), which is specifically as follows:
  • the parameter ⁇ is used to balance the role of the loss function in the optimization process of the entire reflection decoding network model, and it can be set according to actual needs, which is not specifically limited in the embodiment of the present disclosure.
  • the visually impaired person can wear a smart device, which can be smart glasses, a portable smart camera, etc., during the movement of the visually impaired person.
  • a smart device which can be smart glasses, a portable smart camera, etc.
  • You can take real-time images of the road ahead, and then analyze the images through the image description device equipped in the smart device to obtain the corresponding text description.
  • the text description can be output through the corresponding voice output device to make the visually impaired People keep abreast of road conditions and avoid obstacles. For example, when a visually impaired person walks to an intersection, the red light turns on.
  • the image acquisition unit of the smart device can acquire images containing signal lights, zebra crossings, and vehicle traffic conditions.
  • the reflection decoding sub-network in the reflection decoding network model is used to predict the text of the signal lights, zebra crossings, and vehicles in the image. For example, according to the signal lights, the text “signal lights, red lights” can be output, and the zebra crossings can output “zebra crossings, vehicles, no "Pedestrian” and other information, finally based on the word vector corresponding to each image area can generate a text description "The signal light is a red light, there are vehicles on the zebra crossing, and pedestrians cannot pass”. The text description can be sent to the visually impaired in real time to remind them to wait for the green light Light up before passing.
  • the image processing unit can segment the picture and encode the objects in each image area to obtain the first image feature; All pixels in an image feature are averaged, and the pixel values of all pixels are replaced with pixel averages to form the second image feature; then the first image feature, the second image feature and the starting word vector are input to the reflection decoding network Model, through the reflection decoding network model to generate the words at the current time according to the context vector, and predict the relative position of the words at the current time in the sentence, for example, the reflection attention model can generate word vectors in turn: one, little sheep, hillside, eat Grass, according to these word vectors, the final text description can be obtained: a little sheep is grazing on the hillside.
  • the text description can be played through the voice output unit to help the children understand the content of the picture and increase the understanding of things. Cognition.
  • the image processing method in the present disclosure decodes the first image feature encoded by the encoding network model through the reflective decoding network model, and matches the hidden state at the current moment with the hidden state at the past moment through the reflective attention module, and obtains the context vector to generate
  • the word vector at the current moment, and the relative position of the word vector at the current moment in the text description is predicted through the reflection position module, which enhances the correlation and timing logic before and after the sentence, further improves the decoding ability of the language model, and ensures that the comparison is In the case of long or complex sentences, the stability of the model performance can generate a more natural and accurate image text description.
  • the embodiments of the present disclosure mainly focus on the decoding input part of the long-term and short-term time sequence modules, they are improved by introducing a reflective attention module and a reflective position module, but for other enhanced learning, graph convolutional neural networks and generation of confrontation
  • the network technology can also be improved by using the reflective attention module and the reflective position module in the present disclosure to further improve the quality of image description generation.
  • Fig. 11 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure.
  • the image processing apparatus 1100 includes: a feature extraction module 1101, a feature conversion module 1102, and a description generation module 1103.
  • the feature extraction module 1101 is used to obtain an input image, and to encode the objects contained in each image region in the input image to obtain the first image feature; the feature conversion module 1102 is used to obtain the first image feature according to preset rules. The pixels in the first image feature are processed, and the second image feature is determined according to the processed pixels; the description generation module 1103 is configured to perform processing on the first image feature at different times based on the second image feature and the starting word vector.
  • the region features corresponding to each of the image regions are decoded to obtain a word vector corresponding to each of the image regions, and a text description corresponding to the input image is formed according to the word vector, wherein the The starting word vector is the starting mark of the text description.
  • the feature extraction module 1101 is configured to: divide the input image to form a plurality of image regions; and perform feature extraction on objects in the image region through an encoding network model , To obtain an area feature corresponding to the image area; and form the first image feature according to the area feature.
  • the feature conversion module 1102 is configured to: obtain the pixel average value of all pixels in the first image feature, and use the pixel average value as the pixel value of each pixel to form the pixel value of each pixel. Describe the second image feature.
  • the description generation module 1103 is configured to decode the network model by reflection based on the second image feature and the starting word vector, and compare the first image feature with each other at different times.
  • the region feature corresponding to the image region is decoded to obtain the word vector corresponding to each of the image regions.
  • the reflection decoding network model includes a plurality of reflection decoding sub-networks arranged in sequence; the description generation module 1103 is configured to: output the second image feature and the Mth reflection decoding sub-network The word vector of is input to the M+1th reflection decoding sub-network; through the M+1th reflection decoding sub-network, the target region feature in the first image feature is decoded to obtain the corresponding target region feature Word vector; where M is a positive integer.
  • the description generation module 1103 is configured to: input the second image feature and the starting word vector to a first reflection decoding sub-network, and use the first reflection decoding sub-network The target area feature in the first image feature is decoded to obtain a word vector corresponding to the target area feature.
  • the reflection decoding sub-network includes a visual attention module, a reflection type attention module, and a reflection type position module, wherein the reflection type position module is used to predict the output of the reflection decoding sub-network at the current moment The relative position of the word vector in the text description.
  • the visual attention module includes a first long short-term memory network, a second long short-term memory network, and an attention mechanism network;
  • the image processing device 1100 is configured to: The word vector output by the reflection decoding sub-network is multiplied by the first weight matrix to obtain the target word vector; the second image feature and the target word vector are extracted through the first long and short-term memory network to obtain First output information; input the first output information and the first image feature to the attention mechanism network for visual matching to obtain the target area feature; compare the first output information and the first image feature through the second long and short-term memory network A feature extraction is performed on the output information and the target area feature to obtain the second output information.
  • the image processing device 1100 further includes: a word vector generating module, configured to use the reflective attention module according to the second output information at the past moment and the first output at the current moment.
  • the output information and the second output information determine the third output information corresponding to the target area feature at the current moment.
  • the word vector generation module is configured to: determine a target matrix according to the second output information of all the past moments and the second output information of the current moment; and reduce the target matrix. Dimension processing to obtain first feature information, and perform dimensionality reduction processing on the first output information at the current moment to obtain second feature information, wherein the dimensions of the first feature information and the second feature information are the same ; Based on the attention mechanism, the first feature information and the second feature information are added to obtain third feature information; the third feature information is weighted and normalized to obtain the second weight Matrix; the first feature information and the second weight matrix are multiplied and summed to obtain the third output information.
  • the description generation module 1103 is configured to: multiply the third output information by a third weight matrix to obtain a word vector corresponding to the target region feature.
  • the image processing device 1100 further includes: a sample acquisition module for acquiring image samples and text description samples corresponding to the image samples; and a model training module for acquiring image samples based on the image samples. Training the reflective decoding network model to be trained with the text description sample until the loss function corresponding to the reflective decoding network model to be trained is the smallest; wherein the loss function includes a cross-entropy loss function and a position-aware loss function.
  • the cross-entropy loss function is the correct probability of the text description corresponding to the image sample generated by the reflective decoding network to be trained;
  • the position-aware loss function is the current moment The distance between the real position and the predicted position of the word vector output by the reflective decoding network to be trained in the text description sample.
  • the position perception loss corresponding to the position perception loss function is determined by the reflective position module; the image processing device 1100 is configured to output to the reflective attention module through a fully connected layer
  • the feature of the Reflective Attention Module is fully connected to generate fully connected information; the fully connected information is compressed according to a preset compression function to obtain the predicted position information of the word vector corresponding to the feature output by the reflective attention module;
  • the predicted location information and the real location information of the word vector corresponding to the feature output by the reflective attention module in the text description sample determine the location perception loss.
  • Fig. 12 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.
  • the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, which can be loaded into a random storage unit according to a program stored in a read-only memory (Read-Only Memory, ROM) 1202 or from a storage part 1208.
  • the program in the Random Access Memory (RAM) 1203 is accessed to execute various appropriate actions and processing to implement the image processing method described in the foregoing embodiment.
  • RAM 1203 various programs and data required for system operation are also stored.
  • the CPU 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204.
  • An input/output (Input/Output, I/O) interface 1205 is also connected to the bus 1204.
  • the following components are connected to the I/O interface 1205: the input part 1206 including a keyboard, a mouse, etc.; and the output part 1207 such as a cathode ray tube (Cathode Ray Tube, CRT), a liquid crystal display (LCD), and speakers. ; A storage part 1208 including a hard disk, etc.; and a communication part 1209 including a network interface card such as a LAN (Local Area Network) card, a modem, and the like. The communication section 1209 performs communication processing via a network such as the Internet.
  • the driver 1210 is also connected to the I/O interface 1205 as needed.
  • a removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1210 as needed, so that the computer program read therefrom is installed into the storage portion 1208 as needed.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 1209, and/or installed from the removable medium 1211.
  • CPU central processing unit
  • the computer-readable medium shown in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable removable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of the code, and the above-mentioned module, program segment, or part of the code contains one or more for realizing the specified logic function.
  • Executable instructions may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be implemented by It is realized by a combination of dedicated hardware and computer instructions.
  • the units described in the embodiments of the present disclosure may be implemented in software or hardware, and the described units may also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the present disclosure also provides a computer-readable medium.
  • the computer-readable medium may be included in the image processing apparatus described in the above-mentioned embodiments; or it may exist alone without being integrated into the electronic device.
  • the foregoing computer-readable medium carries one or more programs, and when the foregoing one or more programs are executed by an electronic device, the electronic device realizes the method described in the foregoing embodiment.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which can be a personal computer, a server, a touch terminal, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供一种图像处理方法、装置及电子设备,涉及人工智能领域。该方法包括:获取输入图像,提取输入图像中各图像区域的区域特征,以获取第一图像特征;根据预设规则对第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;基于第二图像特征和针对所述输入图像已确定的至少一个词向量,确定与第一图像特征中各图像区域对应的区域特征对应的词向量,预测所述词向量在文本描述中的位置,并根据词向量和所述位置形成与输入图像对应的文本描述。

Description

图像处理方法、装置及电子设备
本申请要求于2019年09月16日提交中国专利局、申请号为201910872478.6、发明名称为“图像描述生成方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能技术领域,具体而言,涉及一种图像处理方法、图像描述生成装置及电子设备。
发明背景
图像生成描述是为一张图片生成能表达其含义的自然语言描述的分析研究,具有广泛的应用前景。比如,通过对一张图片生成文本描述,可以帮助视障人士快速准确地理解图像内容;在幼教领域中对少儿图片生成直观准确地描述,可以给予幼儿更好的启蒙学习等等。
启发于神经网络在图像识别与机器翻译中的成功应用,许多现有方法都在神经网络模型的基础上去生成图像文本描述。目前,图像生成描述主要是使用卷积神经网络将图像编码用一个固定向量表达,然后直接使用循环神经网络将其解码成一个描述内容的句子。但是现有的解码模型较为简单,导致模型在句子较长或句式结构较为复杂时的效果明显下降。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开的实施例提供了一种图像处理方法、图像处理装置及电子设备,进而至少在一定程度上可以准确有效地提取图像中包含的自然语言信息,并生成更为准确、流畅的文本描述。
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。
根据本公开实施例的一个方面,提供了一种图像处理方法,包括:获取输入图像,对所述输入图像中各图像区域所包含的对象进行编码,以获取第一图像特征;根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第 二图像特征;基于所述第二图像特征和起始词向量,在不同时刻对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,以获取与各所述图像区域对应的词向量,并根据所述词向量形成与所述输入图像对应的文本描述,其中所述起始词向量为所述文本描述的起始标记。
根据本公开实施例的一个方面,提供了一种图像处理装置,包括:特征提取模块,用于获取输入图像,对所述输入图像中各图像区域所包含的对象进行编码,以获取第一图像特征;特征转换模块,用于根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;描述生成模块,用于基于所述第二图像特征和起始词向量,在不同时刻对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,以获取与各所述图像区域对应的词向量,并根据所述词向量形成与所述输入图像对应的文本描述,其中所述起始词向量为所述文本描述的起始标记。
本公开的技术方案通过解码网络模型对输入图像对应的图像特征进行解码,一方面能够更准确有效地提取输入图像中包含的自然语言信息;另一方面能够使解码网络模型在句子较长或句式结构较复杂时也能适用,提高了文本描述的准确性和流畅度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图简要说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1为可以应用本公开实施例的技术方案的示例性系统架构的示意图;
图2为相关技术中图像处理方法的流程示意图;
图3为根据本公开的一个实施例的图像处理方法的流程示意图;
图4为根据本公开的一个实施例的反射解码网络模型的结构示意图;
图5为根据本公开的一个实施例的视觉注意力模块的结构示意图;
图6为根据本公开的一个实施例的视觉注意力模块的处理流程示意图;
图7为根据本公开的一个实施例的图像处理的流程示意图;
图8为根据本公开的一个实施例的反射式注意模块的处理流程示意图;
图9为根据本公开的一个实施例的反射式注意模块的结构示意图;
图10为根据本公开的一个实施例的反射位置模块确定位置感知损失的流程示意图;
图11为根据本公开的一个实施例的图像处理装置的框图;
图12为适于用来实现本公开实施例的图像处理装置的计算机系统的结构示意图。
实施本发明的方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本公开的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本公开的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本公开实施例的技术方案的示例性系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、网络102和服务器103。网络102用以在终端设备101和服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实 际需要,可以具有任意数目的终端设备、网络和服务器。比如服务器103可以是多个服务器组成的服务器集群等。
在本公开的一个实施例中,终端设备101将图像通过网络102发送至服务器103,服务器103获取输入图像后,首先可以对输入图像进行划分形成多个图像区域,并通过编码网络模型对各图像区域中的对象进行特征提取,以获取与各图像区域对应的区域特征,进而根据与各图像区域对应的区域特征获得与输入图像对应的第一图像特征;接着根据预设规则对第一图像特征中的像素进行处理,并根据处理后的像素确定与输入图像对应的第二图像特征;然后将第一图像特征、第二图像特征及起始词向量输入反射解码网络模型,通过反射解码网络模型对第一图像特征进行解码,以获取与各图像区域对应的词向量,进而根据各图像区域对应的词向量形成与输入图像对应的文本描述。本公开实施例的技术方案能够保证模型在句子较长或句式结构较为复杂时的性能,进而能够更准确有效地提取图像中包含的自然语言信息,并生成更准确、流畅的文本描述。
需要说明的是,本公开实施例所提供的图像处理方法一般由服务器执行,相应地,图像处理装置一般设置于服务器中。但是,在本公开的其它实施例中,也可以由终端设备执行本公开实施例所提供的图像处理方法。
在本领域的相关技术中,主要通过编解码框架生成图像的文本描述,图2示出了相关技术中图像处理方法的流程示意图,如图2所示,将图像201输入至编码网络模型202,该编码网络模型202包括Faster R-CNN网络和ResNet-101网络,通过Faster R-CNN网络对输入图像提取特征能够获得输入图像中各对象对应的局部特征信息,通过ResNet-101网络对输入图像提取特征能够获取输入图像对应的全局特征信息;接着将局部特征信息和全局特征信息输入至解码网络模型203,该解码网络模型203包括多个重复的网络结构,该网络结构为基于注意力的循环神经网络,具体地,将该全局特征信息输入至第一层LSTM,通过第一层LSTM对全局特征信息进行特征提取,以输出第一隐藏状态;接着该第一隐藏状态和局部特征信息输入至注意力机制网络层,通过注意力机制网络层可以输出一混合特征;然后通过第二层LSTM对该混合特征和第一隐藏状态共同进行处理,以输出第二隐藏状态;最后对第二隐藏状态进行softmax处理,以获得预测的词向量。
虽然图2所示的图像描述生成算法可以取得较好的效果,但是仍然具有局限性。 具体地,提升模型效果的方式只能通过提取更具有代表性的细粒度分隔到单个物体层面的图像特征,而忽略了对语言模型本身的关注。解码网络模型较为简单,导致模型在句子较长或句式结构较为复杂时的效果明显下降。
本公开实施例提供了一种图像处理方法,该图像处理方法涉及人工智能领域,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)是一门研究如何使机器“看”的技术,更进一步的说,就是指用摄影机和电脑对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习(Learning from instruction)等技术。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本公开实施例提供的方案涉及人工智能的图像语义理解技术,具体通过如下实施例进行说明:
本公开实施例首先提出了一种图像处理方法,该图像处理方法可以应用在如儿童早期教育、图像检索及盲人导航等领域,以下对本公开实施例的技术方案的实现细节进行详细阐述:
图3为根据本公开的一个实施例的图像处理方法的流程图。该图像处理方法可以由一个或多个计算设备来执行,该一个或多个计算设备可以是图1中所示的终端设备101和/或服务器103。参照图3所示,该图像处理方法至少包括步骤S310至步骤S330。
在步骤S310中,获取输入图像,对所述输入图像中各图像区域所包含的对象进行编码,以获取第一图像特征。
在本公开的一个实施例中,该输入图像可以是从网络上下载的图像,也可以是存储于终端设备101本地的图像,还可以是用户通过拍摄装置,如照相机、摄像机、智能手机等具有拍摄单元的终端,获取的图像,等等,在确定需要生成文本描述的图像后,可以通过终端设备101将其发送至服务器103。进一步地,该终端设备101可以是任意的具有显示屏幕的终端设备,如智能手机、笔记本电脑、台式机等等,本公开实施例对此不做具体限定。
在本公开的一个实施例中,接收到输入图像后,可以对该输入图像进行划分,以形成多个图像区域,其中划分输入图像的方法可以是根据像素数量进行划分,也可以是根据图像中的不同对象进行划分等等。在对输入图像划分形成多个图像区域后,可以对各图像区域中的对象进行编码,也就是特征提取,例如一幅图像所呈现的场景是一个小孩在院子里拍皮球,那么该图像中的对象就是小孩、皮球和草地,至于图像中如天空、小鸟等背景都可以忽略,不用针对背景进行特征提取。在对各图像区域中的对象进行编码时,可以采用诸如Faster R-CNN、ResNet、VGG等网络结构作为编码网络模型,通过该编码网络模型对各图像区域中的对象进行特征提取, 以获取与各图像区域对应的区域特征,该区域特征实质为与图像区域对应的固定向量表达。进一步地,根据各图像区域对应的区域特征可以获取与输入图像对应的第一图像特征。
在步骤S320中,根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征。
在本公开的一个实施例中,在对输入图像中各图像区域进行特征提取获取第一图像特征后,可以根据第一图像特征中各像素的像素值确定第二图像特征中的像素值。具体地,可以计算第一图像特征中所有像素的像素均值,并将该像素均值作为第二图像特征中每个像素的像素值。该第二图像特征可以作为输入特征输入至反射解码网络模型中,以使反射解码网络模型根据第二图像特征和起始词向量对第一图像特征进行解码,预测与第一图像特征中各图像区域对应的词向量。值得说明的是,本公开实施例中的起始词向量可以是任意的不具有实质语义的字符,例如可以是一个起始标记符,如#,也可以是一个起始标记词,如BN,等等,本公开实施例对此不做具体限定。
在步骤S330中,基于所述第二图像特征和起始词向量,对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,以获取与各所述图像区域对应的词向量,并根据所述词向量形成与所述输入图像对应的文本描述,其中所述起始词向量为所述文本描述的起始标记。
步骤S330中,可以在不同时刻对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,并利用先前已解码的区域特征对当前的区域特征进行解码。
在本公开的一个实施例中,在获取第二图像特征后,将该第二图像特征作为输入特征输入至反射解码网络模型中,同时还可以将起始词向量输入至该反射解码网络模型中,以使其在不同时刻对第一图像特征中与各图像区域对应的区域特征进行解码,获取与各图像区域对应的词向量。
图4示出了反射解码网络模型的结构示意图,如图4所示,反射解码网络模型包括多个依次排列的反射解码子网络,其中各反射解码子网络在不同时刻分别对第一图像特征中与各图像区域对应的区域特征进行解码,以获取与各图像区域对应的词向量。对于第一反射解码子网络而言,可以将第二图像特征和起始词向量作为输入特征进行输入,通过第一反射解码子网络基于第二图像特征和起始词向量对第一 图像特征中的目标区域特征进行解码,以获取与目标区域特征对应的词向量;对于第M+1反射解码子网络而言,可以将第二图像特征和第M反射解码子网络输出的词向量输入至第M+1反射解码子网络,通过第M+1反射解码子网络对第一图像特征中的目标区域特征进行解码,以获取与目标区域特征对应的词向量,其中M为正整数。
本公开的实施例的方法可以包括:
获取输入图像,提取所述输入图像中各图像区域的区域特征,以获取第一图像特征;
根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;
基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,在不同时刻,确定与所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,预测所述词向量在文本描述中的位置,并根据所述词向量和所述位置形成与所述输入图像对应的文本描述。
在本公开的一个实施例中,各反射解码子网络的结构相同,都包含视觉注意力模块、反射式注意模块RAM(Reflective Attention Module)和反射式位置模块RPM(Reflective Position Module)三部分。视觉注意力模块主要关注编码网络模型的视觉特征。反射式注意模块在视觉注意力模块的输出信息的基础上,利用文本注意力机制建模当前时刻和过去时刻该视觉注意力模块输出信息的匹配程度,得到上下文向量,以生成当前时刻的词语,从而能够捕捉到更多综合的历史词汇信息。反射式位置模块能够引入生成的文本描述中每个单词的相对位置信息,在反射解码网络模型预测词汇的同时,预测当前词汇在文本描述中的相对位置,从而帮助反射解码网络模型感知句子的句法结构。
图5示出了视觉注意力模块的结构示意图,如图5所示,视觉注意力模块500包括第一长短期记忆网络(LSTM-1)501、第二长短期记忆网络(LSTM-2)502和注意力机制网络(Attvis)503,其中第一长短期记忆网络501用于根据第二图像特征和前一时刻获得的词向量进行特征提取,第二长短期记忆网络502用于根据第一长短期记忆网络501的输出信息和注意力机制网络503的输出信息进行特征提取,注意力机制网络503用于根据第一图像特征和第一长短期记忆网络501的输出信息 进行特征提取。
进一步地,图6示出了视觉注意力模块的处理流程示意图,为了便于理解,本公开实施例中以第t个反射解码子网络中的视觉注意力模块的处理流程为例进行说明,如图6所示,视觉注意力模块的处理流程至少包括步骤S601-S604,具体为:
在步骤S601中,将前一时刻反射解码子网络输出的词向量与第一权重矩阵相乘,以获取目标词向量。
在本公开的一个实施例中,图7示出了图像处理的流程示意图,如图7所示,对LSTM-1而言,根据第一图像特征确定的第二图像特征
Figure PCTCN2020115559-appb-000001
和前一时刻反射解码子网络输出的词向量为LSTM-1的输入特征,为了保证输入的词向量的维度与LSTM-1处理的数据维度相同,可以对各反射解码子网络的输入词向量特征进行维度调整,具体可以将输入的词向量特征Ot(t=1,…,T)与第一权重矩阵We相乘,以获取目标词向量,实现输入特征Ot维度的改变。值得注意的是,该第一权重矩阵We对每个输入特征Ot是共用的,因此在模型训练的时候针对第一权重矩阵We只需要训练一个参数即可。
在步骤S602中,通过第一长短期记忆网络对第二图像特征和目标词向量进行特征提取,以获取第一输出信息。
在本公开的一个实施例中,将第二图像特征和目标词向量输入至LSTM-1后,LSTM-1对目标词向量和第二图像特征进行处理,以输出第一输出信息,该第一输出信息实质上是LSTM-1输出的隐藏状态(Hidden state),如图7中所示的
Figure PCTCN2020115559-appb-000002
在步骤S603中,将第一输出信息和第一图像特征输入至注意力机制网络进行视觉匹配,以获取目标区域特征。
在本公开的一个实施例中,注意力机制类似于人类视觉,可以选择性地关注所有信息的一部分,同时忽略其它可见的信息。在采用反射解码网络模型进行解码之前,可以通过Faster R-CNN等卷积神经网络对输入图像进行特征提取,以获得第一图像特征{ri}(i=1,…,k),然后当获取LSTM-1输出的第一输出信息后,可以将第一输出信息和第一图像特征同时输入至注意力机制网络,通过注意力机制网络Attvis对第一输出信息和第一图像特征进行视觉匹配,以确定第一图像特征中各区域特征与第一输出信息之间的匹配度,最后将匹配度最高的区域特征作为目标区域 特征从注意力机制网络输出,如图7所示的
Figure PCTCN2020115559-appb-000003
在步骤S604中,通过第二长短期记忆网络对第一输出信息和目标区域特征进行特征提取,以获取第二输出信息。
在本公开的一个实施例中,获得目标区域特征后,该目标区域特征和第一输出信息将作为输入特征输入至LSTM-2,LSTM-2可以对第一输出信息和目标区域特征进行特征提取,以获取与目标区域特征对应的第二输出信息,该第二输出信息即为LSTM-2输出的隐藏状态,如图7中所示的
Figure PCTCN2020115559-appb-000004
值得说明的是,还可以采用其它的循环神经网络替换本公开实施例中的LSTM,并且进一步地,可以采用不同类型的循环神经网络替换本公开实施例中的LSTM-1和LSTM-2,但是由于长短期记忆网络(LSTM,Long Short-Term Memory),是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件,因此为了更精准的预测词汇、形成连贯的文本描述,本公开实施例中的图像处理方法主要采用LSTM进行词汇预测。
之后,一些实施例可以通过所述第一隐藏状态和所述第二隐藏状态确定所述目标区域特征对应的词向量。
在本公开的一个实施例中,在句子较长或句式结构较为复杂时,为了提升解码效果,本公开实施例首先提出了采用反射式注意模块利用文本注意力机制对当前时刻的隐藏状态和过去时刻的隐藏状态进行匹配。如图7所示,对于第t个反射解码子网络中的反射式注意模块RAM而言,除了接收与其对应的LSTM-2输出的第二输出信息,还接收第1~(t-1)个反射解码子网络中的LSTM-2输出的第二输出信息及与其对应的LSTM-1输出的第一输出信息,以根据过去时刻的第二输出信息和当前时刻的第一输出信息及第二输出信息确定当前时刻与目标区域特征对应的第三输出信息。
图8示出了反射式注意模块的处理流程示意图,如图8所示,该处理流程至少包括步骤S801-S805,具体为:
在步骤S801中,根据所有过去时刻的第二输出信息和当前时刻的第二输出信息确定目标矩阵。
在本公开的一个实施例中,图9示出了反射式注意模块的结构示意图,如图9所示,左上角的柱体代表第二输出信息,根据过去时刻的第二输出信息
Figure PCTCN2020115559-appb-000005
Figure PCTCN2020115559-appb-000006
和当前时刻的第二输出信息
Figure PCTCN2020115559-appb-000007
可以组成具有相应维度的目标矩阵,例如可以是1000×1的目标矩阵。
在步骤S802中,对目标矩阵进行降维处理,以获取第一特征信息,并对当前时刻的第一输出信息进行降维处理,以获取第二特征信息,其中第一特征信息和第二特征信息的维度相同。
在本公开的一个实施例中,为了提高计算效率,可以对目标矩阵和当前时刻的第一输出信息进行降维处理,以分别获取具有相同维度的第一特征信息和第二特征信息。如图9所示,目标矩阵、当前时刻的第一输出信息可以分别与一个512维的权重矩阵相乘,使得目标矩阵的维度和第一输出信息的维度均从1000维降低至512维,大大提高了处理效率。
在步骤S803中,基于注意力机制将第一特征信息和第二特征信息相加,以获取第三特征信息。
在本公开的一个实施例中,基于文本注意力机制,可以对第一特征信息和第二特征信息进行相应地处理,如图9所示的Attref,具体可以是将第一特征信息和第二特征信息相加,当然也可以上其它的具体处理方式,本公开实施例对此不做具体限定。在将第一特征信息和第二特征信息相加后,即可获得融合了过去时刻的隐藏状态及当前时刻的隐藏状态的第三特征信息。
在步骤S804中,对第三特征信息进行加权处理和归一化处理,以获取第二权重矩阵。
在本公开的一个实施例中,获取第三特征信息后,可以将该第三特征信息与反射注意力权重Wr相乘,以获取一特征矩阵,该特征矩阵所包含的信息的数量与目标矩阵中第二输出信息的数量相同,都为t个;接着可以对特征矩阵进行softmax处理,即归一化处理,计算每一个信息相对于所有信息的比值,根据与每个第二输出信息对应的比值可以确定第二权重矩阵。
在步骤S805中,将第一特征信息与第二权重矩阵相乘并求和,以获取第三输出信息。
在本公开的一个实施例中,获取包含与所有第二输出信息对应的第二权重矩阵后,可以将根据所有第二输出信息确定的第一特征信息与该第二权重矩阵相乘并求 和,以获取第三输出信息,如图9中所示的右侧柱体
Figure PCTCN2020115559-appb-000008
在本公开的一个实施例中,获取反射式注意模块输出的第三输出信息后,可以将第三输出信息与第三权重矩阵Ws相乘,以获取与目标区域特征
Figure PCTCN2020115559-appb-000009
对应的词向量,如图7所示的St。值得说明的是,t时刻输出的词向量St为t+1时刻的输入向量Ot+1。
在本公开的一个实施例中,如图7所示,当反射式注意模块输出第三输出信息后,该第三输出信息同时可以输入至反射位置模块,该反射位置模块能够根据第三输出信息预测当前时刻输出的词向量在文本描述中的相对位置。具体地,反射位置模块中包含一全连接层和压缩层,第三输出信息输入至反射位置模块后,首先通过全连接层进行全连接,将512×1维的
Figure PCTCN2020115559-appb-000010
转换为1×1维的向量,接着通过压缩层根据相应的压缩函数对全连接层输出的向量进行压缩,以获取一相对位置。该压缩层输出的结果为一个介于0和1之间的数字,其代表了词向量在文本描述中的位置,例如文本描述是一句包含10个单词的句子,压缩层输出的数字是0.6,那么第t个反射解码子网络输出的词向量St在该句子中的位置为第6位。
在本公开的一个实施例中,通过反射解码网络模型中多个依次排列的反射解码子网络对第一图像特征中与各图像区域对应的区域特征进行解码,当遇到句末标点后停止生成词向量,在获得与各图像区域对应的词向量{S1,S2,…,ST}后,可以将该些词向量依次串接,形成与输入图像对应的文本描述。
在本公开的一个实施例中,在使用反射解码网络模型对第一图像特征进行词汇预测以生成文本描述之前,还需要对反射解码网络模型进行训练。具体地,首先获取图像样本和与图像样本对应的文本描述样本,接着将图像样本输入至待训练的反射解码网络模型以生成相应地文本描述,根据生成的文本描述和对应的文本描述样本的匹配程度调节模型参数,直至待训练反射式解码网络模型的损失函数最小。在本公开实施例中,反射解码网络模型的损失函数包括交叉熵损失函数和位置感知损失函数两部分,其中交叉熵损失函数为待训练反射式解码网络生成的与图像样本对应的文本描述的正确概率;位置感知损失函数为当前时刻待训练反射式解码网络模型输出的词向量在文本描述样本中的真实位置和预测位置之间的距离。
在本公开的一个实施例中,要使反射解码网络模型的损失函数最小,必须保证交叉熵损失函数最大且位置感知损失函数最小,其中交叉熵损失函数可以根据公式 (1)确定,具体为:
Figure PCTCN2020115559-appb-000011
其中,I是输入图像;θ是反射解码网络模型的参数,包括上述实施例中的We、Ws、Wr等权重矩阵;S是输入图像对应的正确的长度不固定的文本描述,可代表任何句子。
由于文本描述S中的任何一个词向量依赖与其相邻的前一词向量,因此可以应用链式法则对句子组成词向量S1,S2,…,ST上的联合概率分布做建模表示。进而基于公式(1)可以确定交叉熵损失函数Lxe如公式(2)所示:
Figure PCTCN2020115559-appb-000012
其中,N是生成的文本描述所包含的词汇数,St表示t时刻生成的词向量。
在训练阶段,(S,I)是训练的图像语句对,可以通过随机梯度下降(SGD)的方法优化公式(2)中对数概率的和。
在本公开的一个实施例中,位置感知损失(Position-Perceptive Loss)可以由反射位置模块确定,图10示出了反射位置模块确定位置感知损失的流程示意图,如图10所示,通过全连接层对反射式注意模块输出的第三输出信息进行全连接,以生成全连接信息,该全连接信息可以是1×1的向量;然后根据压缩层对应的预设压缩函数对全连接信息进行压缩处理,以获取与第三输出信息对应的词向量的预测位置,即预测的词向量在文本描述中的相对位置
Figure PCTCN2020115559-appb-000013
最后根据预测位置和与第三输出信息对应的词向量在文本描述样本中的真实位置确定位置感知损失,其中词汇在句子中的真实位置
Figure PCTCN2020115559-appb-000014
可以根据文本描述样本中所包含的词汇数量和与目标区域特征对应的词汇在文本描述样本中的位置获得,进而根据真实位置
Figure PCTCN2020115559-appb-000015
和相对位置
Figure PCTCN2020115559-appb-000016
可以确定位置感知损失Lpos,具体的计算方式如公式(3)所示:
Figure PCTCN2020115559-appb-000017
其中,
Figure PCTCN2020115559-appb-000018
Figure PCTCN2020115559-appb-000019
分别表示当前时刻词向量在句子中真实和预测的相对位置,通过最小化Lpos来缩小两者间的距离。
进一步地,在获取交叉熵损失和位置感知损失后,可以根据公式(4)确定反射解码网络模型对应的损失函数的大小,具体如下:
L=L xe+λL pos     (4)
其中参数λ用于平衡损失函数在整个反射解码网络模型优化过程中的作用,其可以根据实际需要进行设定,本公开实施例对此不做具体限定。
接下来,以盲人导航为例对本公开实施例的技术方案进行说明,视障人士可以佩戴一智能设备,该智能设备具体可以是智能眼镜、便携式智能相机等等,在视障人士运动的过程中,可以实时拍摄前方的道路图像,接着通过智能设备中搭载的图像描述装置对图像进行分析,以获取对应的文本描述,进一步地可以通过相应的语音输出设备将该文本描述输出,以使视障人士及时了解路况,躲避障碍物。例如,当视障人士行走至十字路口时,红灯亮起了,这时智能设备的图像采集单元能够获取包含信号灯、斑马线、车辆通行状况的图像,通过对该图像中的信号灯、斑马线、车辆进行编码,以获取第一图像特征;接着根据第一图像特征中所有像素的像素均值确定第二图像特征;然后将第一图像特征、第二图像特征和起始词向量输入至反射解码网络模型,通过反射解码网络模型中的反射解码子网络依次对图像中的信号灯、斑马线、车辆进行文本预测,比如根据信号灯能够输出文本“信号灯、红灯”、根据斑马线能够输出“斑马线、有车辆、无行人”等信息,最终根据与各图像区域对应的词向量可以生成文本描述“信号灯为红灯,斑马线上有车辆,行人无法通过”,该文本描述可以实时发送给视障人士,提醒其等待绿灯亮起再通行。
以儿童早期教育为例,幼儿在翻阅故事书的时候,会被形形色色的图案吸引,当幼儿观看一幅图画的时候,书本所携带的拍摄装置可以获取该幅图画,并将其输入至图像处理单元以获取对应的文本描述。除此之外,还可以提前将故事书中每页的图画存储起来,当幼儿观看某一页的图画时,该页的图画会被输入至图像处理单元以获取对应的文本描述。例如故事书中有一页图画是一只小羊在山坡上吃草,那么图像处理单元可以对该幅图画进行分割,对各图像区域中的对象进行编码,以获取第一图像特征;接着对第一图像特征中的所有像素求均值,并将所有像素的像素值替换为像素均值,以形成第二图像特征;然后将第一图像特征、第二图像特征和起始词向量输入至反射解码网络模型,通过反射解码网络模型根据上下文向量生成当前时刻的词语,并预测当前时刻的词语在句子中的相对位置,例如通过反射式注 意模型能够依次生成词向量:一只、小绵羊、山坡、吃草,根据这些词向量能够获得最终的文本描述:一只小绵羊在山坡上吃草,在幼儿观看图画的时候,该文本描述可以通过语音输出单元播放,帮助幼儿理解图画内容,增加对事物的认知。
本公开中的图像处理方法通过反射解码网络模型对编码网络模型编码的第一图像特征进行解码,通过反射式注意模块对当前时刻的隐藏状态和过去时刻的隐藏状态进行匹配,得到上下文向量以生成当前时刻的词向量,并通过反射位置模块对当前时刻的词向量在文本描述中的相对位置进行预测,增强了句子前后的关联和时序逻辑,进一步提高了语言模型的解码能力,保证了在较长或复杂句子的情况下,模型性能的稳定性,从而能生成更加自然准确的图像文本描述。
值得说明的是,虽然本公开实施例中主要针对长短期时序模块的解码输入部分,通过引入反射式注意力模块和反射位置模块进行改进,但是对于其它增强学习、图卷积神经网络和生成对抗网络技术也可以采用本公开中的反射式注意力模块和反射位置模块进行改进,进一步提高图像描述的生成质量。
以下介绍本公开的装置实施例,可以用于执行本公开上述实施例中的图像处理方法。对于本公开装置实施例中未披露的细节,请参照本公开上述的图像处理方法的实施例。
图11为根据本公开的一个实施例的图像处理装置的框图。
参照图11所示,根据本公开的一个实施例的图像处理装置1100,包括:特征提取模块1101、特征转换模块1102和描述生成模块1103。
其中,特征提取模块1101,用于获取输入图像,对所述输入图像中各图像区域所包含的对象进行编码,以获取第一图像特征;特征转换模块1102,用于根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;描述生成模块1103,用于基于所述第二图像特征和起始词向量,在不同时刻对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,以获取与各所述图像区域对应的词向量,并根据所述词向量形成与所述输入图像对应的文本描述,其中所述起始词向量为所述文本描述的起始标记。
在本公开的一个实施例中,所述特征提取模块1101配置为:对所述输入图像进行划分,以形成多个所述图像区域;通过编码网络模型对所述图像区域中的对象进行特征提取,以获取与所述图像区域对应的区域特征;根据所述区域特征形成所 述第一图像特征。
在本公开的一个实施例中,所述特征转换模块1102配置为:获取所述第一图像特征中所有像素的像素均值,并将所述像素均值作为各所述像素的像素值,以形成所述第二图像特征。
在本公开的一个实施例中,所述描述生成模块1103配置为:通过反射解码网络模型基于所述第二图像特征和起始词向量,在不同时刻对所述第一图像特征中与各所述图像区域对应的区域特征进行解码,以获取与各所述图像区域对应的词向量。
在本公开的一个实施例中,所述反射解码网络模型包括多个依次排列的反射解码子网络;所述描述生成模块1103配置为:将所述第二图像特征和第M反射解码子网络输出的词向量输入至第M+1反射解码子网络;通过所述第M+1反射解码子网络对所述第一图像特征中的目标区域特征进行解码,以获取与所述目标区域特征对应的词向量;其中,M为正整数。
在本公开的一个实施例中,所述描述生成模块1103配置为:将所述第二图像特征和所述起始词向量输入至第一反射解码子网络,通过所述第一反射解码子网络对所述第一图像特征中的目标区域特征进行解码,以获取与所述目标区域特征对应的词向量。
在本公开的一个实施例中,所述反射解码子网络包括视觉注意力模块、反射式注意模块和反射式位置模块,其中所述反射式位置模块用于预测当前时刻所述反射解码子网络输出的词向量在所述文本描述中的相对位置。
在本公开的一个实施例中,所述视觉注意力模块包括第一长短期记忆网络、第二长短期记忆网络和注意力机制网络;所述图像处理装置1100配置为:将前一时刻所述反射解码子网络输出的词向量与第一权重矩阵相乘,以获取目标词向量;通过所述第一长短期记忆网络对所述第二图像特征和所述目标词向量进行特征提取,以获取第一输出信息;将所述第一输出信息和所述第一图像特征输入至所述注意力机制网络进行视觉匹配,以获取目标区域特征;通过所述第二长短期记忆网络对所述第一输出信息和所述目标区域特征进行特征提取,以获取第二输出信息。
在本公开的一个实施例中,所述图像处理装置1100还包括:词向量生成模块,用于通过所述反射式注意模块根据过去时刻的所述第二输出信息和当前时刻的所述第一输出信息及所述第二输出信息确定当前时刻与目标区域特征对应的第三输出信 息。
在本公开的一个实施例中,所述词向量生成模块配置为:根据所有所述过去时刻的第二输出信息和所述当前时刻的第二输出信息确定目标矩阵;对所述目标矩阵进行降维处理,以获取第一特征信息,并对所述当前时刻的第一输出信息进行降维处理,以获取第二特征信息,其中所述第一特征信息和所述第二特征信息的维度相同;基于注意力机制将所述第一特征信息和所述第二特征信息相加,以获取第三特征信息;对所述第三特征信息进行加权处理和归一化处理,以获取第二权重矩阵;将所述第一特征信息与所述第二权重矩阵相乘并求和,以获取所述第三输出信息。
在本公开的一个实施例中,所述描述生成模块1103配置为:将所述第三输出信息与第三权重矩阵相乘,以获取与所述目标区域特征对应的词向量。
在本公开的一个实施例中,所述图像处理装置1100还包括:样本获取模块,用于获取图像样本和与所述图像样本对应的文本描述样本;模型训练模块,用于根据所述图像样本和所述文本描述样本对待训练反射式解码网络模型进行训练,直至所述待训练反射式解码网络模型对应的损失函数最小;其中所述损失函数包括交叉熵损失函数和位置感知损失函数。
在本公开的一个实施例中,所述交叉熵损失函数为所述待训练反射式解码网络生成的与所述图像样本对应的文本描述的正确概率;所述位置感知损失函数为当前时刻所述待训练反射式解码网络输出的词向量在文本描述样本中的真实位置和预测位置之间的距离。
在本公开的一个实施例中,所述位置感知损失函数对应的位置感知损失由所述反射式位置模块确定;所述图像处理装置1100配置为:通过全连接层对所述反射式注意模块输出的特征进行全连接,以生成全连接信息;根据预设压缩函数对所述全连接信息进行压缩,以获取与所述反射式注意模块输出的特征所对应的词向量的预测位置信息;根据所述预测位置信息和与所述反射式注意模块输出的特征所对应的词向量在所述文本描述样本中的真实位置信息确定所述位置感知损失。
图12示出了适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。
需要说明的是,图12示出的电子设备的计算机系统1200仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,计算机系统1200包括中央处理单元(Central Processing Unit,CPU)1201,其可以根据存储在只读存储器(Read-Only Memory,ROM)1202中的程序或者从存储部分1208加载到随机访问存储器(Random Access Memory,RAM)1203中的程序而执行各种适当的动作和处理,实现上述实施例中所述的图像处理方法。在RAM 1203中,还存储有系统操作所需的各种程序和数据。CPU 1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(Input/Output,I/O)接口1205也连接至总线1204。
以下部件连接至I/O接口1205:包括键盘、鼠标等的输入部分1206;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1207;包括硬盘等的存储部分1208;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分1209。通信部分1209经由诸如因特网的网络执行通信处理。驱动器1210也根据需要连接至I/O接口1205。可拆卸介质1211,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1210上,以便于从其上读出的计算机程序根据需要被安装入存储部分1208。
特别地,根据本公开的实施例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1209从网络上被下载和安装,和/或从可拆卸介质1211被安装。在该计算机程序被中央处理单元(CPU)1201执行时,执行本公开的系统中限定的各种功能。
需要说明的是,本公开实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的 组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
作为另一方面,本公开还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的图像处理装置中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现上述实施例中所述的方法。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元 来具体化。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本公开实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (15)

  1. 一种图像处理方法,由一个或多个计算设备执行,包括:
    获取输入图像,提取所述输入图像中各图像区域的区域特征,以获取第一图像特征;
    根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;
    基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,确定与所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,预测所述词向量在文本描述中的位置,并根据所述词向量和所述位置形成与所述输入图像对应的文本描述。
  2. 根据权利要求1所述的方法,其中,提取所述输入图像中各图像区域的区域特征,以获取第一图像特征,包括:
    对所述输入图像进行划分,以形成多个所述图像区域;
    通过编码网络模型对所述图像区域中的对象进行特征提取,以获取与所述图像区域中的对象对应的区域特征;
    根据所述区域特征形成所述第一图像特征。
  3. 根据权利要求1所述的方法,其中,所述基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,在不同时刻,确定所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,包括:
    通过反射解码网络模型基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,在不同时刻,确定所述第一图像特征中各所述图像区域对应的区域特征对应的词向量。
  4. 根据权利要求3所述的方法,其中,所述反射解码网络模型包括多个依次排列的反射解码子网络;
    所述通过反射解码网络模型基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,在不同时刻,确定所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,包括:
    将所述第二图像特征和第M反射解码子网络输出的词向量输入至第M+1反射解码子网络;
    通过所述第M+1反射解码子网络确定所述第一图像特征中的目标区域特征对应的词向量;其中,M为正整数。
  5. 根据权利要求3所述的方法,其中,所述通过反射解码网络模型基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,在不同时刻,确定所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,包括:
    将所述第二图像特征和起始词向量输入至第一反射解码子网络,通过所述第一反射解码子网络基于所述第二图像特征和所述起始词向量确定所述第一图像特征中的目标区域特征对应的词向量,其中所述起始词向量为所述文本描述的起始标记。
  6. 根据权利要求4所述的方法,其中,所述反射解码子网络包括反射式位置模块,其中所述反射式位置模块用于预测当前时刻所述反射解码子网络输出的词向量在所述文本描述中的相对位置。
  7. 根据权利要求4所述的方法,其中,所述反射解码子网络进一步包括视觉注意力模块,所述视觉注意力模块包括第一长短期记忆网络、第二长短期记忆网络和注意力机制网络;
    其中,通过所述第M+1反射解码子网络确定所述第一图像特征中的目标区域特征对应的词向量包括:
    将前一时刻所述第M+1反射解码子网络输出的词向量与第一权重矩阵相乘,以获取目标词向量;
    通过所述第一长短期记忆网络对所述第二图像特征和所述目标词向量进行特征提取,以确定所述第一长短期记忆网络的第一隐藏状态;
    将所述第一隐藏状态和所述第一图像特征输入至所述注意力机制网络进行视觉匹配,以获取目标区域特征;
    通过所述第二长短期记忆网络对所述第一隐藏状态和所述目标区域特征进行特征提取,以确定所述第二长短期记忆网络的第二隐藏状态;
    通过所述第一隐藏状态和所述第二隐藏状态确定所述目标区域特征对应的词向量。
  8. 根据权利要求7所述的方法,其中,所述反射解码子网络进一步包括反射式注意模块;
    其中,通过所述第一隐藏状态和所述第二隐藏状态确定所述目标区域特征对应 的词向量包括:
    通过所述反射式注意模块根据过去时刻的第二隐藏状态和当前时刻的所述第一隐藏状态及所述第二隐藏状态确定当前时刻与目标区域特征对应的第三输出信息;
    将所述第三输出信息与第三权重矩阵相乘,以获取与所述目标区域特征对应的词向量。
  9. 根据权利要求8所述的方法,其中,确定当前时刻与目标区域特征对应的第三输出信息包括:
    根据所有所述过去时刻的第二隐藏状态和所述当前时刻的第二隐藏状态确定目标矩阵;
    对所述目标矩阵进行降维处理,以获取第一特征信息,并对所述当前时刻的第一隐藏状态进行降维处理,以获取第二特征信息,其中所述第一特征信息和所述第二特征信息的维度相同;
    基于注意力机制将所述第一特征信息和所述第二特征信息相加,以获取第三特征信息;
    对所述第三特征信息进行加权处理和归一化处理,以获取第二权重矩阵;
    将所述第一特征信息与所述第二权重矩阵相乘并求和,以获取所述第三输出信息。
  10. 根据权利要求3所述的方法,其中,在获取输入图像之前,所述方法还包括:
    获取图像样本和与所述图像样本对应的文本描述样本;
    根据所述图像样本和所述文本描述样本对待训练反射解码网络模型进行训练,直至所述待训练反射解码网络模型对应的损失函数最小;
    其中所述损失函数包括交叉熵损失函数和位置感知损失函数。
  11. 根据权利要求10所述的方法,其特征在于,所述交叉熵损失函数为所述待训练反射式解码网络生成的与所述图像样本对应的文本描述的正确概率;所述位置感知损失函数为当前时刻所述待训练反射式解码网络输出的词向量在文本描述样本中的真实位置和预测位置之间的距离。
  12. 根据权利要求11所述的方法,其中,所述位置感知损失函数对应的位置感知损失由所述反射式位置模块确定;
    所述方法还包括:
    通过全连接层对所述反射式注意模块输出的特征进行全连接,以生成全连接信息;
    根据预设压缩函数对所述全连接信息进行压缩,以获取与所述反射式注意模块输出特征所对应的词向量的预测位置;
    根据所述预测位置和与所述反射式注意模块输出特征所对应的词向量在所述文本描述样本中的真实位置确定所述位置感知损失。
  13. 一种图像处理装置,包括:
    特征提取模块,用于获取输入图像,提取所述输入图像中各图像区域的区域特征,以获取第一图像特征;
    特征转换模块,用于根据预设规则对所述第一图像特征中的像素进行处理,并根据处理后的像素确定第二图像特征;
    描述生成模块,用于基于所述第二图像特征和针对所述输入图像已确定的至少一个词向量,确定所述第一图像特征中各所述图像区域对应的区域特征对应的词向量,预测所述词向量在文本描述中的位置,并根据所述词向量和所述位置形成与所述输入图像对应的文本描述。
  14. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1至12中任一项所述的图像处理方法。
  15. 一种计算机可读存储介质,存储一个或多个计算机可读指令,其特征在于,包括:所述指令可由一个或多个处理器执行,用于实现如权利要求1至12中任一项所述的图像处理方法。
PCT/CN2020/115559 2019-09-16 2020-09-16 图像处理方法、装置及电子设备 WO2021052358A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20866551.3A EP3998552A4 (en) 2019-09-16 2020-09-16 METHOD AND APPARATUS FOR IMAGE PROCESSING AND ELECTRONIC DEVICE
JP2021564175A JP7164252B2 (ja) 2019-09-16 2020-09-16 画像処理方法、装置、電子機器及びコンピュータプログラム
US17/517,004 US11907637B2 (en) 2019-09-16 2021-11-02 Image processing method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910872478.6A CN110717498A (zh) 2019-09-16 2019-09-16 图像描述生成方法、装置及电子设备
CN201910872478.6 2019-09-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/517,004 Continuation US11907637B2 (en) 2019-09-16 2021-11-02 Image processing method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2021052358A1 true WO2021052358A1 (zh) 2021-03-25

Family

ID=69210507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115559 WO2021052358A1 (zh) 2019-09-16 2020-09-16 图像处理方法、装置及电子设备

Country Status (5)

Country Link
US (1) US11907637B2 (zh)
EP (1) EP3998552A4 (zh)
JP (1) JP7164252B2 (zh)
CN (1) CN110717498A (zh)
WO (1) WO2021052358A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343664A (zh) * 2021-06-29 2021-09-03 京东数科海益信息科技有限公司 图像文本之间的匹配度的确定方法及装置
CN113435357A (zh) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 语音播报方法、装置、设备及存储介质
CN113468357A (zh) * 2021-07-21 2021-10-01 北京邮电大学 一种图像描述文本生成方法及装置
CN114495101A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本检测方法、文本检测网络的训练方法及装置
CN114663650A (zh) * 2022-03-22 2022-06-24 平安科技(深圳)有限公司 图像描述生成方法及装置、电子设备、可读存储介质
CN115376137A (zh) * 2022-08-02 2022-11-22 北京百度网讯科技有限公司 一种光学字符识别处理、文本识别模型训练方法及装置
CN116453120A (zh) * 2023-04-19 2023-07-18 浪潮智慧科技有限公司 基于时序场景图注意力机制的图像描述方法、设备及介质

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717498A (zh) 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 图像描述生成方法、装置及电子设备
CN111259672A (zh) * 2020-02-12 2020-06-09 新疆大学 基于图卷积神经网络的中文旅游领域命名实体识别方法
CN111324752B (zh) * 2020-02-20 2023-06-16 中国科学技术大学 基于图神经网络结构建模的图像与文本检索方法
CN111597819B (zh) * 2020-05-08 2021-01-26 河海大学 一种基于关键词的大坝缺陷图像描述文本生成方法
CN111916050A (zh) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN112016493A (zh) * 2020-09-03 2020-12-01 科大讯飞股份有限公司 图像描述方法、装置、电子设备及存储介质
CN112232149B (zh) * 2020-09-28 2024-04-16 北京易道博识科技有限公司 一种文档多模信息和关系提取方法及系统
US11625925B2 (en) * 2021-01-05 2023-04-11 GM Global Technology Operations LLC Remote segmentation under limited computational resources and rate constraints
CN113569068B (zh) * 2021-01-19 2023-09-29 腾讯科技(深圳)有限公司 描述内容生成方法、视觉内容的编码、解码方法、装置
CN112819012B (zh) * 2021-01-29 2022-05-03 厦门大学 一种基于多源协同特征的图像描述生成方法
CN113221872B (zh) * 2021-05-28 2022-09-20 北京理工大学 生成对抗网络与多模态融合的假新闻检测方法
CN113377986B (zh) * 2021-06-23 2023-11-07 泰康保险集团股份有限公司 图像检索方法和装置
CN113515951B (zh) * 2021-07-19 2022-07-05 同济大学 基于知识增强注意力网络和组级语义的故事描述生成方法
TWI779815B (zh) * 2021-09-03 2022-10-01 瑞昱半導體股份有限公司 基於知識蒸餾實現的具備臉部校正效果的臉部辨識網路模型
CN114677520A (zh) * 2022-03-22 2022-06-28 平安科技(深圳)有限公司 图像描述方法和装置、计算机设备、存储介质
CN114782702A (zh) * 2022-03-23 2022-07-22 成都瑞数猛兽科技有限公司 一种基于三层lstm推敲网络的图像语义理解算法
CN114972774A (zh) * 2022-04-20 2022-08-30 平安科技(深圳)有限公司 特定区域的图像描述生成方法、装置、设备及存储介质
CN114862666B (zh) * 2022-06-22 2022-10-04 阿里巴巴达摩院(杭州)科技有限公司 图像变换系统、方法、存储介质及电子设备
CN115273810A (zh) * 2022-07-04 2022-11-01 成都理工大学 基于深度学习的多模态图像语音解读方法和系统
CN115359323B (zh) * 2022-08-31 2023-04-25 北京百度网讯科技有限公司 图像的文本信息生成方法和深度学习模型的训练方法
CN115187996B (zh) * 2022-09-09 2023-01-06 中电科新型智慧城市研究院有限公司 语义识别方法、装置、终端设备和存储介质
CN115953590B (zh) * 2022-12-12 2024-01-30 之江实验室 一种分段式细粒度的商品图像描述生成方法、装置和介质
CN116597454A (zh) * 2023-05-24 2023-08-15 北京百度网讯科技有限公司 图像处理方法、图像处理模型的训练方法和装置
CN116912629B (zh) * 2023-09-04 2023-12-29 小舟科技有限公司 基于多任务学习的通用图像文字描述生成方法及相关装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN109190619A (zh) * 2018-08-23 2019-01-11 重庆大学 一种基于目标掩膜的图像描述方法
CN110111399A (zh) * 2019-04-24 2019-08-09 上海理工大学 一种基于视觉注意力的图像文本生成方法
CN110110145A (zh) * 2018-01-29 2019-08-09 腾讯科技(深圳)有限公司 描述文本生成方法及装置
CN110119754A (zh) * 2019-02-27 2019-08-13 北京邮电大学 图像生成描述方法、装置及模型
CN110210499A (zh) * 2019-06-03 2019-09-06 中国矿业大学 一种图像语义描述的自适应生成系统
CN110717498A (zh) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 图像描述生成方法、装置及电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006113776A (ja) 2004-10-14 2006-04-27 Konica Minolta Photo Imaging Inc 画像処理システムおよび画像処理プログラム
JP5521881B2 (ja) 2010-08-12 2014-06-18 富士ゼロックス株式会社 画像識別情報付与プログラム及び画像識別情報付与装置
JP2013021482A (ja) 2011-07-11 2013-01-31 Nippon Telegr & Teleph Corp <Ntt> 映像アノテーション付与装置およびその動作方法
EP3049975B1 (en) 2013-09-25 2018-11-07 HeartFlow, Inc. Systems and methods for validating and correcting automated medical image annotations
US10192129B2 (en) * 2015-11-18 2019-01-29 Adobe Systems Incorporated Utilizing interactive deep learning to select objects in digital visual media
JP6867153B2 (ja) 2016-12-21 2021-04-28 ホーチキ株式会社 異常監視システム
JP2019135636A (ja) 2018-02-05 2019-08-15 株式会社あのころコミュニケーションズ 自分史作成支援システム

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN110110145A (zh) * 2018-01-29 2019-08-09 腾讯科技(深圳)有限公司 描述文本生成方法及装置
CN109190619A (zh) * 2018-08-23 2019-01-11 重庆大学 一种基于目标掩膜的图像描述方法
CN110119754A (zh) * 2019-02-27 2019-08-13 北京邮电大学 图像生成描述方法、装置及模型
CN110111399A (zh) * 2019-04-24 2019-08-09 上海理工大学 一种基于视觉注意力的图像文本生成方法
CN110210499A (zh) * 2019-06-03 2019-09-06 中国矿业大学 一种图像语义描述的自适应生成系统
CN110717498A (zh) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 图像描述生成方法、装置及电子设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIU YU: "Design and Implementation of Image Captioning Model Based on Deep Learning", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 15 January 2019 (2019-01-15), pages 1 - 61, XP055793544 *
See also references of EP3998552A4

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343664A (zh) * 2021-06-29 2021-09-03 京东数科海益信息科技有限公司 图像文本之间的匹配度的确定方法及装置
CN113343664B (zh) * 2021-06-29 2023-08-08 京东科技信息技术有限公司 图像文本之间的匹配度的确定方法及装置
CN113435357A (zh) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 语音播报方法、装置、设备及存储介质
CN113435357B (zh) * 2021-06-30 2022-09-02 平安科技(深圳)有限公司 语音播报方法、装置、设备及存储介质
CN113468357A (zh) * 2021-07-21 2021-10-01 北京邮电大学 一种图像描述文本生成方法及装置
CN113468357B (zh) * 2021-07-21 2023-07-11 北京邮电大学 一种图像描述文本生成方法及装置
CN114495101A (zh) * 2022-01-12 2022-05-13 北京百度网讯科技有限公司 文本检测方法、文本检测网络的训练方法及装置
CN114663650A (zh) * 2022-03-22 2022-06-24 平安科技(深圳)有限公司 图像描述生成方法及装置、电子设备、可读存储介质
CN115376137A (zh) * 2022-08-02 2022-11-22 北京百度网讯科技有限公司 一种光学字符识别处理、文本识别模型训练方法及装置
CN115376137B (zh) * 2022-08-02 2023-09-26 北京百度网讯科技有限公司 一种光学字符识别处理、文本识别模型训练方法及装置
CN116453120A (zh) * 2023-04-19 2023-07-18 浪潮智慧科技有限公司 基于时序场景图注意力机制的图像描述方法、设备及介质
CN116453120B (zh) * 2023-04-19 2024-04-05 浪潮智慧科技有限公司 基于时序场景图注意力机制的图像描述方法、设备及介质

Also Published As

Publication number Publication date
JP7164252B2 (ja) 2022-11-01
US11907637B2 (en) 2024-02-20
JP2022530785A (ja) 2022-07-01
EP3998552A1 (en) 2022-05-18
US20220058332A1 (en) 2022-02-24
EP3998552A4 (en) 2022-11-02
CN110717498A (zh) 2020-01-21

Similar Documents

Publication Publication Date Title
WO2021052358A1 (zh) 图像处理方法、装置及电子设备
CN111930992B (zh) 神经网络训练方法、装置及电子设备
US12008810B2 (en) Video sequence selection method, computer device, and storage medium
CN111898696A (zh) 伪标签及标签预测模型的生成方法、装置、介质及设备
CN113297370B (zh) 基于多交互注意力的端到端多模态问答方法及系统
Le et al. An overview of deep learning in industry
CN112580720A (zh) 一种模型训练方法及装置
CN113505193A (zh) 一种数据处理方法及相关设备
WO2024083121A1 (zh) 一种数据处理方法及其装置
Wang et al. (2+ 1) D-SLR: an efficient network for video sign language recognition
WO2022222854A1 (zh) 一种数据处理方法及相关设备
CN117033609B (zh) 文本视觉问答方法、装置、计算机设备和存储介质
Cao et al. Visual question answering research on multi-layer attention mechanism based on image target features
CN116980541A (zh) 视频编辑方法、装置、电子设备以及存储介质
Guo Analysis of artificial intelligence technology and its application in improving the effectiveness of physical education teaching
Liu et al. Digital twins by physical education teaching practice in visual sensing training system
Rawat et al. Indian sign language recognition system for interrogative words using deep learning
Xiao et al. Gaze prediction based on long short-term memory convolution with associated features of video frames
Ma et al. Multimodal data processing framework for smart city: A positional-attention based deep learning approach
Li Expression Recognition of Classroom Children’s Game Video Based on Improved Convolutional Neural Network
Zhu Image-Based Storytelling for Tourist Using Deep Learning
CN117034133A (zh) 一种数据处理方法、装置、设备和介质
Chen et al. Adversarial Training for Image Captioning Incorporating Relation Attention
Yong DCC-net network model for motion data management based on infrared light sensor
Liu et al. Research on Automatic Recognition Technology of Print Robot in Human-Computer Interaction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20866551

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021564175

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE