CN114037831B

CN114037831B - Image depth dense description method, system and storage medium

Info

Publication number: CN114037831B
Application number: CN202111344143.0A
Authority: CN
Inventors: 孔锐; 谢玮
Original assignee: Xinghan Intelligent Technology Co ltd
Current assignee: Xinghan Intelligent Technology Co ltd
Priority date: 2021-07-20
Filing date: 2021-11-12
Publication date: 2023-08-04
Anticipated expiration: 2041-11-12
Also published as: CN114037831A

Abstract

The invention discloses an image depth dense description method, an image depth dense description system and a storage medium, which relate to the field of artificial intelligence, wherein the image depth dense description method comprises the following steps: receiving an input image; extracting a target area of the input image; generating a local feature vector and a context feature vector of the target region; initializing and generating by adopting the local feature vector and the context feature vector; a bounding box of the target region and a corresponding semantic description sentence are generated. The image depth dense description method, the system and the storage medium can effectively solve the problem of locating and identifying targets in the image dense description, and efficiently and accurately locate and semantically describe target areas in images.

Description

Image depth dense description method, system and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method, a system and a storage medium for densely describing image depth.

Background

At present, most image description methods only generate a high-level single sentence to describe the semantic content of an input image, so that an important limit exists in the image semantic information which can be expressed: details are given. In recent years, researchers have proposed an ROI (Region of Interest ) oriented image description task that overcomes the above-described semantic limitations, which aims to automatically locate the ROI in an image and generate natural language phrases or sentences to characterize semantic information in the ROI. There are still many challenges in dense descriptions of images, such as problems that make it difficult to locate the target region for dense and highly overlapping ROIs in the image, that make it difficult to identify the target for some visually obscured ROIs, etc.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides the image depth dense description method, the system and the storage medium, which can effectively solve the problems of positioning and identifying the target in the image dense description and realize the efficient and accurate positioning and semantic description of the target region in the image.

An image depth dense description method according to an embodiment of the first aspect of the present invention includes: receiving an input image; extracting a target area of the input image; generating a local feature vector and a context feature vector of the target region; initializing and generating by adopting the local feature vector and the context feature vector; a bounding box of the target region and a corresponding semantic description sentence are generated.

The image depth dense description method provided by the embodiment of the invention has at least the following beneficial effects: the image depth dense description method extracts a target area from an input image, further achieves demarcating a boundary box of the target area, generates corresponding semantic description sentences, can achieve efficient and accurate positioning of the target area, and can conduct accurate semantic description on the target area.

According to some embodiments of the invention, the extracting the target area of the input image comprises the steps of: processing the input image through depth CNN and generating an image feature map; and predicting candidate areas of the image feature images to obtain the target areas.

According to some embodiments of the invention, the input image is processed through a depth CNN and an image feature map is generated; and predicting candidate areas of the image feature images to obtain the target areas.

According to some embodiments of the invention, the image feature map predicts the candidate regions as inputs to the RPN by a set of translational invariance anchor regression offsets, and each of the candidate regions is assigned a confidence score, and the target region is sampled from the candidate regions.

According to some embodiments of the invention, the generating the local feature vector and the contextual feature vector of the target region comprises the steps of:

mapping the obtained target area to a convolution characteristic map; converting the corresponding region on the convolution feature map into the local feature vector; and generating the context feature vector by the local feature vector after the serial conversion.

According to some embodiments of the invention, the converting of the local feature vector specifically includes: and converting the corresponding region on the convolution feature map into the local feature vector by adopting an ROI pooling layer.

According to some embodiments of the invention, the initializing and generating operations using the local feature vector and the contextual feature vector include the steps of:

initializing hidden states of a first long-short-term memory network and a second long-short-term memory network by adopting the local feature vector; inputting a start mark of a semantic description sentence into the second long-term and short-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-short-term memory network to predict the next word of the semantic description sentence, and simultaneously encoding the embedded feature vector into the first long-short-term memory network to understand the semantic information of the target area.

According to some embodiments of the invention, the initializing and generating operations using the local feature vector and the contextual feature vector further comprise the steps of:

respectively initializing hidden states of the second long-short-term memory network and the third long-short-term memory network by adopting local feature vectors and contextual feature vectors; inputting a start mark of a semantic description sentence into the second long-short-term memory network and the third long-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-term memory network and the third long-term memory network, and combining the hidden states of the second long-term memory network and the third long-term memory network through addition and fusion to predict the next word of the semantic description sentence.

An image depth dense description system according to an embodiment of a second aspect of the present invention comprises a region detector and a positioning and description network, the region detector comprising a depth CNN for generating an image feature map, an RPN for acquiring candidate regions, and an ROI pooling layer for generating local feature vectors and context feature vectors by conversion;

the positioning and describing network comprises a joint positioning module and a context reasoning module, wherein the joint positioning module and the context reasoning module are used for receiving the local feature vector and the context feature vector output by the region detector, the joint positioning module is used for positioning a target region and generating a boundary frame of the target region, and the context reasoning module is used for generating a semantic description sentence.

The image depth dense description system provided by the embodiment of the invention has at least the following beneficial effects: the local feature vector and the contextual feature vector of the input image are obtained through the area detector, the extracted feature vector is used for initializing a positioning and describing network to generate a boundary box of the target area and corresponding semantic description sentences, accurate positioning and identification of the target area are achieved, and the semantic description sentences corresponding to the target area are generated.

According to some embodiments of the invention, the joint location module comprises a first long-short-term memory network and a second long-short-term memory network, the context inference module comprises the second long-short-term memory network and a third long-short-term memory network, the joint location module and the context inference module share the second long-short-term memory network, the first long-short-term memory network is used for understanding semantic information of the target area and improving location of the target area, and the second long-short-term memory network and the third long-short-term memory network are combined through an addition fusion device and used for predicting and generating a next word in the semantic description sentence.

A computer-readable storage medium according to an embodiment of the third aspect of the present invention stores computer-executable instructions for performing the image intensity description method as in the embodiment of the first aspect described above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic flow chart of steps of an image depth dense description method according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating a step of generating feature vectors according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the steps of the initialization and generation operation according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating another step of the initialization and generation operation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an image depth intensity description system according to an embodiment of the invention;

fig. 6 is a schematic diagram of a network for locating and describing an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, a plurality means one or more, and a plurality means two or more, and it is understood that greater than, less than, exceeding, etc. does not include the present number, and it is understood that greater than, less than, within, etc. include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the following description, suffixes such as "module", "part" or "unit" for representing elements are used only for facilitating the description of the present invention, and have no particular meaning in themselves. Thus, "module," "component," or "unit" may be used in combination.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

Embodiments of the present invention are further described below with reference to the accompanying drawings.

In some embodiments of the present invention, referring to fig. 1, a step flow of an image depth dense description method is shown, wherein the image depth dense description method includes, but is not limited to, the following steps:

step S110, an input image is received.

Step S120 extracts a target area of the input image.

Step S130, generating a local feature vector and a context feature vector of the target region.

In step S140, the local feature vector and the context feature vector are used for initialization and generation operations.

Step S150, generating a boundary box of the target area and corresponding semantic description sentences.

It can be understood that, for the received input image, the features of the input image are extracted to form a target area, and then the local feature vector and the context feature vector of the target area are generated, the local feature vector and the context feature vector are utilized to initialize and generate the system model, and the bounding box and the semantic description sentence of the target area are correspondingly generated, so that the positioning and the recognition of the target in the intensive description of the image can be effectively solved, and the efficient and accurate positioning and the semantic description of the target area in the image are realized.

In some embodiments of the present invention, an input image is processed through a depth CNN (Convolutional Neural Networks, convolutional neural network) to generate an image feature map, and candidate region prediction is performed on the image feature map, so as to obtain a target region.

In some embodiments of the present invention, for prediction of candidate regions, where an input image is taken as an input to RP N (Region Proposal Network ), the candidate regions may be predicted by a set of anchor regression offsets that translate invariance. Each candidate region is assigned a confidence score that is used to represent the score of the prediction of the candidate region. And sampling and acquiring a target region from the candidate regions, for example, selecting the candidate region with high confidence score as the target region.

In some embodiments of the present invention, referring to FIG. 2, steps for generating local feature vectors and contextual feature vectors are shown, including but not limited to:

step S210, mapping the target area to the convolution characteristic map.

Step S220, converting the corresponding region on the convolution feature map into a local feature vector.

In step S230, the context feature vector is generated from the local feature vector after the series conversion.

It should be understood that the corresponding region refers to a mapped region corresponding to the target region on the convolution feature map. And the local feature vector can be a plurality of local feature vectors, which are specifically determined by the number of target areas, and the context feature vector is generated by connecting the local feature vectors in series.

In some embodiments of the present invention, in particular, an ROI pooling layer is employed to convert the corresponding region on the convolution feature map into local feature vectors.

In some embodiments of the present invention, referring to fig. 3, the initializing and generating operations using the local feature vector and the contextual feature vector include the following steps:

step S310, initializing hidden states of the first long-short-term memory network and the second long-short-term memory network by using the local feature vector.

Step S320, inputting the beginning mark of the semantic description sentence into a second long-short-term memory network.

Step S330, it is determined whether an end mark of a sentence is generated.

Step S340, if not, feeding back the embedded feature vector of the predicted word to the second long-short-term memory network, and encoding the embedded feature vector into the first long-short-term memory network.

It will be understood that the hidden states of the first Long short Term Memory network (LSTM) (e.g., L-LSTM) and the second Long Term Memory network (e.g., C-LST M) are initialized by the local feature vectors, the start tag of the semantic description sentence is input into the C-LSTM, the embedded feature vector of the predicted word is fed back into the C-LSTM, and the embedded feature vector is encoded into the L-LSTM, and when the sentence is not ended, i.e., the sentence end tag is not generated, step S340 is repeated all the time.

In some embodiments of the present invention, referring to fig. 4, the initializing and generating operations using the local feature vector and the contextual feature vector include the following steps:

in step S410, the hidden states of the second long-term memory network and the third long-term memory network are initialized by using the local feature vector and the context feature vector, respectively.

Step S420, inputting the start mark of the semantic description sentence into the second long-short-term memory network and the third long-short-term memory network.

Step S430, it is determined whether an end mark of the sentence is generated.

Step S440, if not, feeding back the embedded feature vector of the predicted word to the second long-short-term memory network and the third long-short-term memory network, and combining the hidden states of the second long-short-term memory network and the third long-short-term memory network through addition fusion.

It will be appreciated that the initialization and generation operations using the local feature vector and the contextual feature vector may also be: the hidden states of the second long-short-term memory network and the third long-short-term memory network are respectively initialized by adopting the local feature vector and the context feature vector, such as the hidden states of the C-LSTM and the G-LSTM, the start marks of the semantic description sentences are input into the C-LSTM and the G-LSTM, the embedded feature vectors of the predicted words are fed back to the C-LSTM and the G-LSTM, the hidden states of the C-LSTM and the G-LSTM are combined through addition fusion, so that the semantic description sentences are improved, and when the sentences are not ended, namely sentence end marks are not generated, the step S440 is repeated all the time.

Referring to fig. 5 and 6, in some embodiments of the present invention, an image depth dense description system is presented that includes a Region Detector (Region Detector) and a positioning and description network (Localizat ion and Captioning Network). The region detector includes a depth CNN for generating an image feature map (feature maps), an RPN for acquiring candidate regions, and an ROI pooling layer (ROI pooling layers) for converting to generate local feature vectors and contextual feature vectors. For example, an input image enters a region detector, firstly, the depth CNN processes the input image to generate an image feature map, then the image feature map is used as an input of the RPN, a candidate region is predicted through a set of anchor regression offsets which are not changed in translation, a target region is obtained by sampling the predicted candidate region, the target region is mapped onto a convolution feature map, the corresponding region of the convolution feature map is converted into a local feature vector with a fixed length by using an ROI pooling layer, and the context feature vector can be generated by connecting the local feature vectors in series.

The localization and description network includes a joint localization module (Joint Localization Module) and a context inference module (Contextual Reasoning Module), each of which receives local feature vectors and context feature vectors from the region detector, the joint localization module being configured to localize the target region and generate a bounding box for the target region, and the context inference module being configured to generate a semantic description sentence.

The joint positioning module comprises two long-term memory networks, wherein the two long-term memory networks comprise a first long-term memory network and a second long-term memory network, such as L-LSTM and C-LSTM, and local feature vectors are used for initializing hidden states of the L-LSTM and the C-LSTM; the beginning tag of a sentence (e.g., < SOS >) is entered into the C-LSTM; the embedded feature vector of the predicted word in each time step is fed back to the C-LSTM to predict the next word, and the embedded feature vector of the predicted word is also encoded into the L-LSTM to understand the semantic information of the target region, thereby improving the localization of the target region, and the process is repeated until an end tag of the sentence (e.g., < EOS >) is generated.

The context reasoning module also comprises two long-term memory networks, wherein the two long-term memory networks comprise a second long-term memory network and a third long-term memory network, such as C-LSTM and G-LSTM, and the context reasoning module shares one long-term memory network with the joint positioning module, such as C-LSTM, and uses local feature vectors and context feature vectors to initialize the hidden states of the C-LSTM and the G-LSTM respectively; the beginning tag of the sentence (e.g., < SOS >) inputs C-LSTM and G-LSTM; the embedded feature vectors of the predicted words in each time step are fed back to the C-LSTM and G-LSTM to predict the next word, and the process repeats until an end tag of the sentence is generated (e.g. < EOS >). The context reasoning module combines the hidden states of the C-LSTM and the G-LSTM through the addition fusion device so as to improve the semantic description sentence of the target area.

In some embodiments of the present invention, there is also provided a computer-readable storage medium storing computer-executable instructions for performing the image depth dense description method of the above embodiments.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The mobile communication device embodiments described above are merely illustrative, wherein the elements illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present invention.

Claims

1. An image depth dense description method, comprising:

receiving an input image;

extracting a target area of the input image;

generating a local feature vector and a context feature vector of the target area specifically comprises: mapping the obtained target area to a convolution characteristic map; converting the corresponding region on the convolution feature map into the local feature vector; generating the context feature vector by the local feature vector after the serial conversion; the conversion of the local feature vector specifically comprises the following steps: converting the corresponding region on the convolution feature map into the local feature vector with fixed length by adopting an ROI pooling layer;

initializing and generating by adopting the local feature vector and the context feature vector, wherein the method specifically comprises the following steps: initializing hidden states of a first long-short-term memory network and a second long-short-term memory network by adopting the local feature vector; inputting a start mark of a semantic description sentence into the second long-term and short-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-short-term memory network to predict the next word of the semantic description sentence, and simultaneously encoding the embedded feature vector into the first long-short-term memory network to understand the semantic information of the target area and improve the positioning of the target area; further comprises: respectively initializing hidden states of the second long-short-term memory network and the third long-short-term memory network by adopting local feature vectors and contextual feature vectors; inputting a start mark of a semantic description sentence into the second long-short-term memory network and the third long-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-term memory network and the third long-term memory network, and combining the hidden states of the second long-term memory network and the third long-term memory network through addition and fusion to predict the next word of the semantic description sentence;

a bounding box of the target region and a corresponding semantic description sentence are generated.

2. The image depth dense description method of claim 1, wherein the extracting the target region of the input image comprises the steps of:

processing the input image through depth CNN and generating an image feature map; and predicting candidate areas of the image feature images to obtain the target areas.

3. The image depth dense description method of claim 2, wherein the image feature map predicts the candidate regions as inputs to the RPN by a set of translational invariance anchor regression offsets, and each of the candidate regions is provided with a confidence score, and the target region is sampled from the candidate regions.

4. An image depth dense description system, comprising:

a region detector for receiving an input image, the region detector comprising a depth CNN for generating an image feature map, an RPN for acquiring a candidate region, and an ROI pooling layer for converting a generated context feature vector and a fixed length local feature vector, comprising: sampling a predicted candidate region to obtain a target region, mapping the target region onto a convolution feature map, converting a corresponding region of the convolution feature map into a local feature vector with a fixed length by using an ROI pooling layer, and connecting the local feature vectors in series to generate a context feature vector;

the positioning and describing network comprises a joint positioning module and a context reasoning module, wherein the joint positioning module and the context reasoning module are used for receiving the local feature vector and the context feature vector output by the region detector, the joint positioning module is used for positioning a target region and generating a boundary frame of the target region, and the context reasoning module is used for generating a semantic description sentence; the method specifically comprises the following steps:

the joint positioning module comprises a first long-period memory network and a second long-period memory network, and adopts the local feature vector to initialize the hidden states of the first long-period memory network and the second long-period memory network; inputting a start mark of a semantic description sentence into the second long-term and short-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-short-term memory network to predict the next word of the semantic description sentence, and simultaneously encoding the embedded feature vector into the first long-short-term memory network to understand the semantic information of the target area and improve the positioning of the target area;

the context reasoning module comprises a second long-short-term memory network and a third long-short-term memory network, and the local feature vector and the context feature vector are adopted to initialize the hidden states of the second long-short-term memory network and the third long-short-term memory network respectively; inputting a start mark of a semantic description sentence into the second long-short-term memory network and the third long-term memory network; judging whether an end mark of a sentence is generated or not; if not, feeding back the embedded feature vector of the predicted word to the second long-term memory network and the third long-term memory network, and combining the hidden states of the second long-term memory network and the third long-term memory network through addition and fusion to predict the next word of the semantic description sentence.

5. A computer-readable storage medium storing computer-executable instructions for performing the image depth dense description method of any one of claims 1 to 3.