EP3696729A1 - Method, apparatus, device and readable storage medium for image-based data processing - Google Patents

Method, apparatus, device and readable storage medium for image-based data processing Download PDF

Info

Publication number
EP3696729A1
EP3696729A1 EP19210677.1A EP19210677A EP3696729A1 EP 3696729 A1 EP3696729 A1 EP 3696729A1 EP 19210677 A EP19210677 A EP 19210677A EP 3696729 A1 EP3696729 A1 EP 3696729A1
Authority
EP
European Patent Office
Prior art keywords
feature
text
image
matching
matching degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19210677.1A
Other languages
German (de)
English (en)
French (fr)
Inventor
Jianhui Huang
Pingping Huang
Min Qiao
Ying Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Publication of EP3696729A1 publication Critical patent/EP3696729A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/22Character recognition characterised by the type of writing
    • G06V30/224Character recognition characterised by the type of writing of printed characters having additional code marks or containing code marks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/12Bounding box
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/10Recognition assisted with metadata

Definitions

  • Embodiments of the present disclosure relate to the computer vision technology, and in particular to, a method, apparatus, device, and readable storage medium for image-based data processing.
  • VQA Visual question answer
  • VQA is one of the leading-edge applications of multimodal data mining intended for natural language question and answer on visual images, and connects vision with language as a research direction of visual understanding. VQA needs to process specific text questions based on understanding of the images.
  • the current method for image-based data processing first extracts low-level features of an image and a text respectively using two different underlying representation systems, learns high-level features of the image and the text, associates high-level features of the image and the text by an associated learning module, and then processes the text.
  • the current method for image-based data processing needs to learn an association relationship between the text and each object in the image based on a feature of the image and a feature of the text, such that the association relationship has a low accuracy, thereby resulting in incorrect text processing.
  • Embodiments of the present disclosure provide a method, apparatus, device, and readable storage medium for image-based data processing, to accurately learn an association relationship between a text and each object in an image, and improve the processing accuracy.
  • an embodiment of the present disclosure provides a method for image-based data processing, including:
  • an embodiment of the present disclosure further provides an apparatus for image-based data processing, including:
  • an embodiment of the present disclosure further provides an electronic device, including:
  • an embodiment of the present disclosure further provides a computer readable storage medium, storing a computer program thereon, where the program, when executed by a processor, implements the method for image-based data processing according to any one embodiment of the present disclosure.
  • the embodiments of the present disclosure acquire an image and a to-be-processed text, extract features of a plurality of objects in the image, extract a feature of the text, fuse the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, make full use of the prior knowledge that the feature of the text is associated with the feature of the object, and adjust the feature of the image based on the matching degree, such that the fused feature pays more attention to a part strongly associated with the text, to avoid attention distribution and dispersion; and can improve the processing accuracy based on the fused feature strongly associated with the text and the feature of the text.
  • Fig. 1a is a flowchart of a method for image-based data processing according to Embodiment I of the present disclosure.
  • the present embodiment may be applicable to a case of identifying an image and processing a text.
  • the method may be executed by an apparatus for image-based data processing.
  • the apparatus may be composed of hardware and/or software, is generally integrated into an electronic device, and specifically includes the following operations S110 to S140.
  • S110 acquiring an image and a to-be-processed text.
  • the image may be a photo, a screenshot, a video frame, and the like.
  • the to-be-processed text is a text including free-form and open natural language related to the image.
  • the to-be-processed text includes understanding of the text, such as true or false determination, and text content interpretation.
  • the to-be-processed text further includes natural language questions, and the types of the questions presented by the text include, but are not limited to, fine-grain identification (e.g., is this lady a white people?), object identification (e.g., how many bananas are there in the figure?), behavior identification (e.g., is this lady crying?), and understanding of texts included in the questions.
  • S120 extracting features of a plurality of objects in the image, and extracting a feature of the text.
  • the image is inputted into a target detection model or a classification model, to extract features of a plurality of objects in the image. Further, coordinates of a bounding box of each object are further extracted.
  • the target detection model or the classification model may be a target detection model or a classification model based on deep learning, e.g., R-CNN, and Fast R-CNN.
  • Fig. 1b is a schematic diagram of a bounding box of each object according to Embodiment I of the present disclosure.
  • Fig. 1b shows two objects, which are respectively a bear body and a bear paw.
  • the bounding box where the bear body is located is represented by a thick solid line, while the bounding box where the bear paw is located is represented by a thin solid line.
  • the feature of the text is extracted by a bag of words model or a Recurrent Neural Network (RNN).
  • RNN Recurrent Neural Network
  • S130 fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects.
  • an attention mechanism when observing the image, people actually do not watch each pixel of the whole image for one time, but mostly focuses attention on a specific part of the image as needed, such as a face of a person. Furthermore, based on an image observed before, people learn a position on which the attention should be focused when observing the image in the future. Similarly, for the text, attention on each object in the image is also different. For example, for “can you see the bear paw?" the attention should be focused on the bear paw in Fig. 1b . For another example, for "what is the bear's expression” the attention should be focused on a bear head in Fig. 1b .
  • the text can be more accurately processed based on a feature of an object on which the text pays more attention.
  • the attention of the text on each object is represented using the matching degree between the feature of the text and the feature of each object.
  • the feature of each object is adjusted based on the matching degree between the feature of the text and the feature of each object. For example, a feature of an object, having a great matching degree with the feature of the text, is strengthened, a feature of an object, having a small matching degree with the feature of the text, is weakened; and then, an adjusted feature of each object is fused as a new feature of the image. For the ease of description and distinguishing, the fused new feature of the image is known as the fused feature of the image. Fig.
  • FIG. 1c is a schematic diagram of an image corresponding to a fused feature according to Embodiment I of the present disclosure.
  • the to-be-processed text is "can you see the bear paw?".
  • a matching degree between the feature of the text and a feature of the bear paw object is 90%
  • a matching degree between the feature of the text and a feature of a bear leg is 50%
  • a matching degree between the feature of the text and features of other objects e.g., a tree trunk object, and a grass cluster object
  • a feature of a corresponding object is adjusted using a matching degree, and a fused feature of the image is obtained by fusion.
  • a feature of an object having a matching degree with the feature of the text greater than or equal to a matching degree threshold is retained or strengthened, a feature of an object having a matching degree with the feature of the text smaller than the matching degree threshold is deleted or weakened, and then the retained features are fused to obtain the fused feature of the image.
  • the feature of the bear paw is strengthened, the feature of the bear leg is not changed, and the features of other objects are weakened.
  • the method further includes calculating the matching degree between the feature of the text and the feature of each object.
  • the features of the plurality of objects in the image are extracted, and a category of each object is acquired, e.g., the bear paw, the bear leg, the tree trunk, the grass cluster, and the like.
  • the category of each object is searched for in the text, and the matching degree between the feature of the text and the feature of each object is determined based on a searching result.
  • S140 processing the text based on the fused feature of the image and the feature of the text.
  • the text processing includes, but is not limited to, understanding of the text, such as true or false determination, text content interpretation; and answer to the text.
  • the fused feature of the image and the feature of the text are inputted into a visual question answer (VQA) system, to obtain an answer outputted from the VQA system.
  • VQA visual question answer
  • the VQA system includes a combination of the following models, e.g., Deeper LSTM Q+norm I model, VIS+LSTM model, 2-VIS+BLSTM, and IMG+BOW.
  • the embodiments of the present disclosure acquire an image and a to-be-processed text, extract features of a plurality of objects in the image, extract a feature of the text, fuse the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, make full use of the prior knowledge that the feature of the text is associated with the feature of the object, and adjust the feature of the image based on the matching degree, such that the fused feature pays more attention to a part strongly associated with the text, to avoid attention distribution and dispersion; and can improve the processing accuracy based on the fused feature strongly associated with the text and the feature of the text.
  • the present embodiment is further optimized on the basis of various alternative implementations of the above embodiment.
  • the step of "inputting an image portion within a bounding box corresponding to each object and a text into a matching model successively, to obtain respective matching degrees, outputted from the matching model, between a feature of each object and features of words in the text; and obtaining the matching degree between the feature of the text and the feature of the each object based on the respective matching degrees between the feature of the each object and the feature of the words in the text” is added.
  • Fig. 2a is a flowchart of a method for image-based data processing according to Embodiment II of the present disclosure. The method according to the present embodiment includes steps S210 to 260.
  • S210 acquiring an image and a to-be-processed text.
  • S220 extracting features of a plurality of objects in the image, and extracting a feature of the text.
  • S230 inputting an image portion within a bounding box corresponding to each object and a text into a matching model successively, to obtain respective matching degrees, outputted from the matching model, between a feature of each object and features of words in the text.
  • each image portion is inputted into a target detection model or a classification model successively, to extract features of a plurality of objects in the image, and coordinates of the bounding box of each object.
  • an image portion within a bounding box corresponding to each object is screenshotted from the image based on the coordinates of the bounding box of each object.
  • an image portion within the bounding box corresponding to each object are inputted into a matching model successively.
  • the text should also be inputted. The text is inputted merely once, and may not be inputted any longer when subsequent image portions are inputted. The text may alternatively be inputted when image portions are inputted each time.
  • Fig. 2b is a schematic flowchart of matching performed by a matching model according to Embodiment II of the present disclosure.
  • the matching model includes: a step of extracting an image feature, a step of extracting a text feature, a step of converting an image feature dimension, a step of converting a text feature dimension, and a step of matching.
  • the step of extracting an image feature is used for extracting the feature of each object from the image portion within the bounding box corresponding to each object; the step of converting an image feature dimension is used for converting a dimension of the feature of each object into a preset dimension; the step of extracting a text feature is used for extracting the feature of each word in the text; the step of converting a text feature dimension is used for converting a dimension of the feature of each word in the text into the preset dimension; the step of matching is used for calculating the respective matching degrees between the feature of each object with the preset dimension and the features of the words with the preset dimension.
  • the image contains more information than the text.
  • the dimension of the feature of each object is different from the dimension of the feature of each word in the text.
  • the dimension of the feature of each object is 1024
  • the dimension of the feature of each word is 300, which need to be converted into the preset dimension, to compute the matching degree between the feature of each object and the feature of each word.
  • the extracted features are converted by matrix transformation to obtain features in a same dimension, e.g., 600-dimensional feature.
  • the matching degree between the feature of each object with the preset dimension and the feature of each word with the preset dimension is calculated.
  • the step of matching is specifically used for calculating a respective distance between the feature of each object and the feature of each word in the text, a respective cosine similarity between the feature of each object and the feature of each word in the text, or a combination of the distance and the cosine similarity , to obtain the matching degree between the feature of each object and the feature of each word in the text.
  • the distance includes a Euclidean distance, a Mahalanobis distance, and the like.
  • a greater value, a smaller value, or an average value of the distances and the cosine similarities between the feature of each object and the feature of each word in the text may be selected to obtain the matching degree between the feature of each object and the feature of each word in the text.
  • the method before inputting the image portion within the bounding box corresponding to each object and the text into to the matching model successively, the method further includes training the model matching.
  • Fig. 2c is a flowchart of training the matching model according to Embodiment II of the present disclosure.
  • the training process generally includes step I to III.
  • Step I acquiring an image portion within a bounding box corresponding to a positive sample object for training the matching model, an image portion within a bounding box corresponding to a negative sample object for training the matching model, and a label of the positive sample object.
  • the label of the positive sample object is a category of the positive sample object.
  • Annotation information of each image in a VG data set includes each object in the image, a relationship, and an attribute, and coordinates of the object and the attribute in the bounding box of the image.
  • the object is strongly associated with the image portion within the corresponding bounding box.
  • the image and the label are acquired using the existing VG (Visual Genome) data set.
  • the preset positive sample object is S
  • the label corresponding to the positive sample object is also S
  • the negative sample object is non-S.
  • the image portion within the bounding box corresponding to the positive sample object is screenshotted based on coordinates of the bounding box of the positive sample object S in the image
  • the image portion within the bounding box corresponding to the negative sample object is screenshotted based on coordinates of the bounding box of the negative sample object (non-S) in the image.
  • the positive sample object is the bear paw
  • the negative sample object is the bear body
  • the image portion within the corresponding bounding box is framed by a solid line
  • the label of the positive sample object is the bear paw.
  • Step II inputting the image portion within the bounding box corresponding to the positive sample object, the image portion within the bounding box corresponding to the negative sample object, and the label into the matching model, to obtain a first matching degree between a feature of the positive sample object and a feature of the label, and a second matching degree between a feature of the negative sample object and the feature of the label.
  • the step of extracting an image feature in the matching model extracts the feature of the positive sample object from the image portion within the boundary box corresponding to the positive sample object, and extracts the feature of the negative sample object from the image portion within the bounding box corresponding to the negative sample object.
  • the step of extracting a text feature extracts the feature of the label. Then, the step of converting an image feature dimension converts the dimension of the feature of the positive sample object and the dimension of the feature of the negative sample object into the preset dimension, and the step of converting a text feature dimension converts the dimension of the feature of the label into the preset dimension.
  • the step of matching calculates the first matching degree between the feature of the positive sample object and the feature of the label, and the second matching degree between the feature of the negative sample object and the feature of the label.
  • the first matching degree is at least one of the distance or the cosine similarity between the feature of the positive sample object and the feature of the label
  • the second matching degree is at least one of the distance or the cosine similarity between the feature of the negative sample object and the feature of the label.
  • Step III training the matching model for a purpose of maximizing the first matching degree and minimizing the second matching degree, or for a purpose of a difference between the first matching degree and the second matching degree being greater than a preset threshold.
  • a target function is established based on the maximized first matching degree and the minimized second matching degree, or the target function is established based on the difference between the first matching degree and the second matching degree greater than the preset threshold, and then parameters of the matching model are iterated based on the target function.
  • parameters of all or a part of steps of the matching model may be iterated. For example, parameters of the step of extracting an image feature, the step of extracting a text feature, the step of converting an image feature dimension, and the step of converting a text feature dimension use empirical values without iteration, and only parameters of the step of matching are iterated.
  • S240 obtaining the matching degree between the feature of the text and the feature of each object based on the respective matching degrees between the feature of each object and the features of words in the text.
  • a maximum matching degree or an average matching degree corresponding to the feature of each object with regard to the respective matching degrees between the feature of each object and the feature of words in the text is calculated, for use as the matching degree between the feature of the text and the feature of each object.
  • the matching degrees between the feature of "you” and the feature of the bear paw, between the feature of "can” and the feature of the bear paw, between the feature of "see” and the feature of the bear paw, between the feature of "bear paw” and the feature of the bear paw, and between the feature of "?” and the feature of the bear paw are 10%, 10%, 10%, 90%, and 10% respectively, and then the matching degree between the text and the feature of the bear paw is: the maximum matching degree 90%, or the average matching degree 26%.
  • the matching degrees between the feature of "you” and the cluster glass, between the feature of "can” and the cluster glass, between the feature of "see” and the cluster glass, between the feature of "bear paw” and the cluster glass, and between the feature of "?” and the cluster glass are 15%, 10%, 10%, 10%, and 10% respectively, and then the matching degree between the text and the feature of the grass cluster is: the maximum matching degree 15%, or the average matching degree 11%.
  • S250 fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects.
  • S260 processing the text based on the fused feature of the image and the feature of the text.
  • Fig. 2d is a flowchart of a method for image-based data processing using a matching model according to Embodiment II of the present disclosure.
  • An input of an apparatus for image-based data processing is the text "can you see the bear paw?", and the image is shown in Fig. 1b .
  • the apparatus for image-based data processing not only extracts the feature of the text, but also obtains the matching degree by the matching model, then fuses the features of the plurality of objects as the fused feature of the image based on the matching degree, then performs fusion reclassification on the feature of the text and the fused feature, and then processes the text.
  • the image portion within the bounding box corresponding to each object and the text are inputted into the matching model successively, to obtain the respective matching degrees, outputted from the matching model, between the feature of each object and the features of words in the text, thus directly obtaining the matching degree between the feature of the object and the feature of each word respectively based on a pre-trained matching degree, reflecting words in the text corresponding to local features from the perspective of the image; and obtaining local information in the image corresponding to the words from the perspective of the text.
  • the matching degree between the object and the text is specific to the matching degree between the object and each word, and fine-grain and accurate association between the local features of the image and the word is pre-learned.
  • the matching degree between the feature of the text and the feature of each object is obtained based on the respective matching degrees between the feature of each object and the feature of words in the text, such that for the matching degree between the text and each word, the matching degree between the text and the object is comprehensively obtained, thus improving the accuracy of the matching degree, and further improving the text processing accuracy.
  • the present embodiment trains the matching model using positive and negative samples, reduces the distance between matching positive sample object and the label, increases the distance between unmatching negative sample object and the label, and can effectively improve the model training accuracy.
  • the samples for pre-training the matching model include only the image portions within the bounding box and labels. Compared with VQA data including images, questions, and answers, the sample acquiring channels are extensive, then the application scenarios are wide, and the text is easily expanding.
  • the method for image-based data processing using a matching model according to the present embodiment is a universal and easily expanding multimodal learning method with wide application scenarios, and low application costs. In the case where the computing system process of the original task is not changed greatly, the matching model may be applied to almost all multimodal tasks.
  • the present embodiment makes full use of strong correlation between the label of the object and the object, and between the label and the text, and helps the apparatus for image-based data processing to strengthen learning the association between the object and the text.
  • Fig. 3 is a flowchart of a method for image-based data processing according to Embodiment III of the present disclosure.
  • the embodiment of the present disclosure specifies operations on the basis of the technical solutions of the above embodiments.
  • the "fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects" is specific to "performing a weighted summation of the features of the objects based on the matching degree between the feature of the text and the feature of each object, to obtain the fused feature of the image.”
  • the method for image-based data processing as shown in Fig. 3 includes steps S310 to S340.
  • S310 acquiring an image and a to-be-processed text.
  • S320 extracting features of a plurality of objects in the image, and extracting a feature of the text.
  • S330 performing a weighted summation of features of objects based on a matching degree between the feature of the text and the feature of each object, to obtain the fused feature of the image.
  • the matching degree between the feature of the text and the feature of each object may be obtained using the following alternative embodiments.
  • the first alternative embodiment the image portion within the bounding box corresponding to each object and the text are inputted into the matching model successively, to obtain the respective matching degrees, outputted from the matching model, between the feature of each object and the feature of words in the text; and the matching degree between the feature of the text and the feature of each object is obtained based on the respective matching degrees between the feature of each object and the feature of words in the text.
  • the second alternative embodiment a category of each object is acquired; the category of each object is searched for in the text, and the matching degree between the feature of the text and the feature of each object is determined based on a searching result.
  • the matching degree between the feature of the text and the feature of each object is used as a weight of each object, and a weighted summation of the features of the corresponding objects is performed using the weights, to obtain the fused features of the image.
  • the matching degree between the feature of the text "can you see the bear paw?” and the feature of the bear paw is 90%
  • the matching degree between the feature of the text "can you see the bear paw?” and the grass cluster is 10%
  • the matching degree between the feature of the text "can you see the bear paw?” and the feature of the tree trunk is 10%
  • the matching degree between the feature of the text "can you see the bear paw?” and the feature of the bear leg is 50%
  • the fused features of the image are 90% ⁇ the feature of the bear paw+ 10% ⁇ the feature of the grass cluster+ 10% ⁇ the feature of the tree trunk+ 50% ⁇ the feature of the bear leg.
  • the features of the plurality of objects use the features without dimension conversion, i.e., the feature of each object extracted from the image portion within the bounding box corresponding to each object.
  • S340 processing the text based on the fused feature of the image and the feature of the text.
  • the present embodiment replaces high-level features of the image with the fused feature of the image.
  • the fused feature has prior knowledge of the matching degree between local features of the image and the text, and thus helps to improve the text processing accuracy. For example, since there is a high matching degree between the feature of the "bear paw" in the text and the feature of the bear paw object, the apparatus for image-based data processing can accurately find an area corresponding to the "bear paw" in the image, and then obtain the correct answer being "yes" after analysis.
  • Fig. 4 is a schematic structural diagram of an apparatus for image-based data processing according to Embodiment IV of the present disclosure.
  • the embodiment of the present disclosure is applicable to a case of identifying an image and processing a text.
  • the apparatus for image-based data processing includes: an acquiring module 410, an extracting module 420, a fusing module 430, and a processing module 440.
  • the acquiring module 410 is configured to acquire an image and a to-be-processed text.
  • the extracting module 420 is configured to extract features of a plurality of objects in the image, and extract a feature of the text.
  • the fusing module 430 is configured to fuse the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects.
  • the processing module 440 is configured to process the text based on the fused feature of the image obtained by the fusing module 430 and the feature of the text extracted by the extracting module 420.
  • the embodiments of the present disclosure acquire an image and a to-be-processed text, extract features of a plurality of objects in the image, extract a feature of the text, fuse the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, make full use of the prior knowledge that the feature of the text is associated with the feature of the object, and adjust the feature of the image based on the matching degree, such that the fused feature pays more attention to a part strongly associated with the text, to avoid attention distribution and dispersion; and can improve the processing accuracy based on the fused feature strongly associated with the text and the feature of the text.
  • the apparatus further includes a first matching degree acquiring module configured to, before the fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, input an image portion within a bounding box corresponding to each object and a text into a matching model successively, to obtain respective matching degrees, outputted from the matching model, between a feature of each object and features of words in the text; and obtain the matching degree between the feature of the text and the feature of each object based on the respective matching degrees between the feature of each object and the features of words in the text.
  • a first matching degree acquiring module configured to, before the fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, input an image portion within a bounding box corresponding to each object and a text into a matching model successively, to obtain respective matching degrees, output
  • the matching model includes: a step of extracting an image feature, a step of extracting a text feature, a step of converting an image feature dimension, a step of converting a text feature dimension, and a step of matching.
  • the step of extracting an image feature is used for extracting the feature of each object from the image portion within the bounding box corresponding to each object; the step of converting an image feature dimension is used for converting a dimension of the feature of each object into a preset dimension; the step of extracting a text feature is used for extracting the feature of each word in the text; the step of converting a text feature dimension is used for converting a dimension of the feature of each word in the text into the preset dimension; and the step of matching is used for calculating the matching degree between the feature of each object with the preset dimension and the feature of each word with the preset dimension.
  • the step of matching is specifically used for: calculating respective distances and/or cosine similarities between the feature of each object with the preset dimension and the features of words in the text with the preset dimension, to obtain the matching degrees between the feature of each object and the feature of words in the text.
  • the apparatus further includes a model training module configured to acquire, before inputting an image portion within a bounding box corresponding to each object and a text into a matching model successively, an image portion within a bounding box corresponding to a positive sample object for training the matching model, an image portion within a bounding box corresponding to a negative sample object for training the matching model, and a label of the positive sample object; input the image portion within the bounding box corresponding to the positive sample object, the image portion within the bounding box corresponding to the negative sample object, and the label into the matching model, to obtain a first matching degree between a feature of the positive sample object and a feature of the label, and a second matching degree between a feature of the negative sample object and the feature of the label; train the matching model for a purpose of maximizing the first matching degree and minimizing the second matching degree, or for a purpose of a difference between the first matching degree and the second matching degree being greater than a preset threshold.
  • a model training module configured to acquire, before inputting an image portion within
  • the first matching degree acquiring module is specifically configured to, when obtaining the matching degree between the feature of the text and the feature of each object based on the respective matching degrees between the feature of each object and the feature of words in the text: calculate a maximum matching degree or an average matching degree corresponding to the feature of each object with regard to the respective matching degrees between the feature of each object and the feature of words in the text, for use as the matching degree between the feature of the text and the feature of each object.
  • the apparatus further includes a second matching degree acquiring module configured to, before the fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, acquire a category of each object; search for the category of each object in the text, and determine the matching degree between the feature of the text and the feature of each object based on a search result.
  • a second matching degree acquiring module configured to, before the fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects, acquire a category of each object; search for the category of each object in the text, and determine the matching degree between the feature of the text and the feature of each object based on a search result.
  • the fusing module 430 is specifically configured to, when fusing the features of the plurality of objects as a fused feature of the image based on a matching degree between the feature of the text and a feature of each object of the plurality of objects: perform a weighted summation of the features of objects based on the matching degree between the feature of the text and the feature of each object, to obtain the fused feature of the image.
  • the apparatus for image-based data processing according to the embodiment of the present disclosure may execute the method for image-based data processing according to any embodiment of the present disclosure, and has corresponding function modules for executing the method and beneficial effects.
  • Fig. 5 is a schematic structural diagram of an electronic device according to Embodiment V of the present disclosure.
  • Fig. 5 shows a block diagram of an example electronic device 12 adapted to implement the embodiments of the present disclosure.
  • the electronic device 12 shown in Fig. 5 is merely an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 12 is expressed in the form of a general-purpose computing device.
  • Components of the electronic device 12 may include, but are not limited to: one or more processors or a processing unit 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • the bus 18 represents one or more of a few bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus with any bus structure of a plurality of bus structures.
  • the system structures include, but are not limited to, an industrial standard architecture (ISA) bus, a micro channel architecture (MAC) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a peripheral component interconnect (PCI) bus.
  • ISA industrial standard architecture
  • MAC micro channel architecture
  • VESA Video Electronics Standards Association
  • PCI peripheral component interconnect
  • the electronic device 12 typically includes a plurality of computer system readable media. These media may be any available medium that can be accessed by the electronic device, including volatile media, non-volatile media, removable media and non-removable media.
  • the system memory 28 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (RAM) 30 and/or a cache memory 32.
  • the electronic device 12 may further include other removable/non-removable, and volatile/non-volatile computer system storage media.
  • a storage system 34 may be used for reading from and writing in non-removable and nonvolatile magnetic media (not shown in Fig. 5 , generally known as a "hard drive").
  • a disk drive for reading from and writing in a removable non-volatile disk such as a "floppy disk” and an optical driver for reading from and writing in a removable non-volatile disk (such as CD-ROM, DVD-ROM, or other optical media) may be provided, though the disk drive or the optical driver is not shown in Fig. 5 .
  • each drive may be connected to the bus 18 through one or more data media interfaces.
  • the memory 28 may include at least one program product, the program product has a set of (e.g., at least one) program modules, and the program modules are configured to execute the functions of the embodiments of the present disclosure.
  • a program/utility software 40 with a set of (at least one) program modules 42 may be stored in, e.g., the memory 28.
  • a program module 42 includes, but is not limited to, an operating system, one or more application programs, other program modules, and program data. Each of these examples or a combination thereof may include implementation of a network environment.
  • the program module 42 generally executes the functions and/or methods in the embodiments according to the present disclosure.
  • the electronic device 12 may also communicate with one or more external devices 14 (e.g., a keyboard, a pointing device, a displayer 24, and a camera), and may also communicate with one or more devices that cause a user to interact with the electronic device 12, and/or communicates with any device (e.g., a network card and a modem) that causes the electronic device 12 to communicate with one or more of other computing devices. This communication may be performed via an input/output (I/O) interface 22.
  • the electronic device 12 may further communicate with one or more networks (e.g., a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) via a network adapter 20.
  • networks e.g., a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet
  • the network adapter 20 communicates with other modules of the electronic device 12 through the bus 18.
  • other hardware and/or software modules may be used in combination with the electronic device 12, including but not limited to: a microcode, a device driver, a redundancy processing unit, an external disk drive array, a RAID system, a tape drive, and a data backup storage system, though the modules are not shown in the figure.
  • the processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, such as implementing the method for image-based data processing according to the embodiments of the present disclosure.
  • Embodiment VI of the present disclosure further provides a computer readable storage medium, storing a computer program thereon, where the program, when executed by a processor, implements the method for image-based data processing according to any one embodiment of the present disclosure.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • An example of the computer readable storage medium may include, but is not limited to: electric, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, elements, or a combination of any of the above.
  • a more specific example (non-enumerated list) of the computer readable storage medium may include, but is not limited to: electrical connection with one or more wire, a portable computer disk, a hard disk, a random access memory (RAM), a read only memory (ROM), an erasable programmable read only memory (EPROM or flash memory), an optical fiber, a portable compact disk read only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the above.
  • the computer readable storage medium may be any tangible medium containing or storing programs which may be used by, or used in combination with, a command execution system, apparatus, or element.
  • the computer readable signal medium may include data signal in the base band or propagating as a part of a carrier wave, in which computer readable program codes are carried.
  • the propagating data signal may take various forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above.
  • the computer readable signal medium may also be any computer readable medium except for the computer readable storage medium.
  • the computer readable medium is capable of transmitting, propagating or transferring programs for use by, or used in combination with, a command execution system, apparatus or element.
  • the program codes contained on the computer readable medium may be transmitted with any suitable medium, including but not limited to : wireless, wired, optical cable, RF medium, etc., or any suitable combination of the above.
  • a computer program code for executing operations in the present disclosure may be compiled using one or more programming languages or combinations thereof.
  • the programming languages include object-oriented programming languages, such as Java, Smalltalk or C++, and also include conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code may be completely executed on a user's computer, partially executed on a user's computer, executed as a separate software package, partially executed on a user's computer and partially executed on a remote computer, or completely executed on a remote computer or server.
  • the remote computer may be connected to a user's computer through any network, including local area network (LAN) or wide area network (WAN), or may be connected to an external computer (for example, connected through Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, connected through Internet using an Internet service provider

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
EP19210677.1A 2019-02-12 2019-11-21 Method, apparatus, device and readable storage medium for image-based data processing Pending EP3696729A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910111412.5A CN109858555B (zh) 2019-02-12 2019-02-12 基于图像的数据处理方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
EP3696729A1 true EP3696729A1 (en) 2020-08-19

Family

ID=66897798

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19210677.1A Pending EP3696729A1 (en) 2019-02-12 2019-11-21 Method, apparatus, device and readable storage medium for image-based data processing

Country Status (5)

Country Link
US (1) US11151406B2 (zh)
EP (1) EP3696729A1 (zh)
JP (1) JP6893233B2 (zh)
KR (1) KR102266529B1 (zh)
CN (1) CN109858555B (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163465A (zh) * 2020-09-11 2021-01-01 华南理工大学 细粒度图像分类方法、系统、计算机设备及存储介质
CN113597614A (zh) * 2020-12-31 2021-11-02 商汤国际私人有限公司 图像处理方法和装置、电子设备及存储介质

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334749B (zh) * 2019-06-20 2021-08-03 浙江工业大学 基于注意力机制的对抗攻击防御模型、构建方法及应用
CN111125422B (zh) * 2019-12-13 2024-04-02 北京达佳互联信息技术有限公司 一种图像分类方法、装置、电子设备及存储介质
US11645505B2 (en) * 2020-01-17 2023-05-09 Servicenow Canada Inc. Method and system for generating a vector representation of an image
CN111611420B (zh) * 2020-05-26 2024-01-23 北京字节跳动网络技术有限公司 用于生成图像描述信息的方法和装置
CN111782838B (zh) * 2020-06-30 2024-04-05 北京百度网讯科技有限公司 图像问答方法、装置、计算机设备和介质
CN114969417B (zh) * 2020-09-23 2023-04-11 华为技术有限公司 图像重排序方法、相关设备及计算机可读存储介质
CN112100358A (zh) * 2020-09-27 2020-12-18 四川长虹电器股份有限公司 一种基于匹配算法的视觉问答方法及系统
CN113903432B (zh) * 2020-11-18 2024-08-27 苏州律点信息科技有限公司 影像分辨率提高方法、装置、电子设备及存储介质
CN113516143B (zh) * 2020-11-26 2024-08-27 腾讯科技(深圳)有限公司 文本图像匹配方法、装置、计算机设备及存储介质
CN112417132B (zh) * 2020-12-17 2023-11-17 南京大学 一种利用谓宾信息筛选负样本的新意图识别方法
CN112541475B (zh) * 2020-12-24 2024-01-19 北京百度网讯科技有限公司 感知数据检测方法及装置
CN112580620A (zh) * 2020-12-25 2021-03-30 北京百度网讯科技有限公司 标志图片处理方法、装置、设备和介质
CN112613293B (zh) * 2020-12-29 2024-05-24 北京中科闻歌科技股份有限公司 摘要生成方法、装置、电子设备及存储介质
KR102533775B1 (ko) * 2020-12-31 2023-05-19 중앙대학교 산학협력단 데이터 통합 분석 학습을 이용한 데이터 분류 장치 및 방법
CN112926586A (zh) * 2021-02-19 2021-06-08 北京大米未来科技有限公司 一种文本识别的方法、装置、可读存储介质和电子设备
CN113033307B (zh) * 2021-02-22 2024-04-02 浙江大华技术股份有限公司 对象的匹配方法、装置、存储介质及电子装置
KR102279797B1 (ko) * 2021-03-05 2021-07-21 전남대학교산학협력단 멀티모달 데이터 융합 시스템 및 방법
CN113222026B (zh) * 2021-05-18 2022-11-11 合肥工业大学 一种机务段场景视觉问答方法、系统及服务器
CN113342995B (zh) * 2021-07-05 2022-12-02 成都信息工程大学 一种基于路径语义和特征提取的负样本提取方法
CN113609279B (zh) * 2021-08-05 2023-12-08 湖南特能博世科技有限公司 一种物料型号提取方法、装置及计算机设备
CN113709548B (zh) * 2021-08-09 2023-08-25 北京达佳互联信息技术有限公司 基于图像的多媒体数据合成方法、装置、设备及存储介质
CN114329068B (zh) * 2021-08-11 2024-05-31 腾讯科技(深圳)有限公司 一种数据处理方法及装置、电子设备、存储介质
CN113792617B (zh) * 2021-08-26 2023-04-18 电子科技大学 一种结合图像信息和文本信息的图像解译方法
JPWO2023157265A1 (zh) * 2022-02-18 2023-08-24
CN114626455A (zh) * 2022-03-11 2022-06-14 北京百度网讯科技有限公司 金融信息处理方法、装置、设备、存储介质及产品
CN115690149B (zh) * 2022-09-27 2023-10-20 江苏盛利智能科技有限公司 显示器的图像融合处理系统及方法
CN115456176B (zh) * 2022-10-10 2023-07-21 延边大学 一种基于知识增强的文本匹配方法及系统
CN115661727B (zh) * 2022-12-27 2023-04-28 苏州浪潮智能科技有限公司 视频的行为定位方法、装置、电子设备及存储介质

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3166049A1 (en) * 2015-11-03 2017-05-10 Baidu USA LLC Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9384408B2 (en) * 2011-01-12 2016-07-05 Yahoo! Inc. Image analysis system and method using image recognition and text search
US10489448B2 (en) 2016-06-02 2019-11-26 Baidu Usa Llc Method and system for dynamically ranking images to be matched with content in response to a search query
US20170364492A1 (en) * 2016-06-20 2017-12-21 Machine Learning Works, LLC Web content enrichment based on matching images to text
US10198671B1 (en) * 2016-11-10 2019-02-05 Snap Inc. Dense captioning with joint interference and visual context
CN110532571B (zh) 2017-09-12 2022-11-18 腾讯科技(深圳)有限公司 文本处理方法及相关装置
CN108228703B (zh) 2017-10-31 2020-05-08 北京市商汤科技开发有限公司 图像问答方法、装置、系统和存储介质
CN108446404B (zh) * 2018-03-30 2021-01-05 中国科学院自动化研究所 面向无约束视觉问答指向问题的检索方法及系统
CN108920587B (zh) * 2018-06-26 2021-09-24 清华大学 融合外部知识的开放域视觉问答方法及装置
CN108898185A (zh) * 2018-07-03 2018-11-27 北京字节跳动网络技术有限公司 用于生成图像识别模型的方法和装置
CN109241267B (zh) * 2018-09-27 2022-07-01 北京百度网讯科技有限公司 生成vqa系统的训练数据的方法、装置、设备和介质

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3166049A1 (en) * 2015-11-03 2017-05-10 Baidu USA LLC Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HASSAN AKBARI ET AL: "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 November 2018 (2018-11-28), XP080940085 *
ZHOU TAO ET AL: "Attention-Based Natural Language Person Retrieval", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), IEEE, 21 July 2017 (2017-07-21), pages 27 - 34, XP033145753, DOI: 10.1109/CVPRW.2017.10 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163465A (zh) * 2020-09-11 2021-01-01 华南理工大学 细粒度图像分类方法、系统、计算机设备及存储介质
CN112163465B (zh) * 2020-09-11 2022-04-22 华南理工大学 细粒度图像分类方法、系统、计算机设备及存储介质
CN113597614A (zh) * 2020-12-31 2021-11-02 商汤国际私人有限公司 图像处理方法和装置、电子设备及存储介质

Also Published As

Publication number Publication date
JP2020135852A (ja) 2020-08-31
KR20200098379A (ko) 2020-08-20
US11151406B2 (en) 2021-10-19
CN109858555B (zh) 2022-05-17
CN109858555A (zh) 2019-06-07
JP6893233B2 (ja) 2021-06-23
US20200257922A1 (en) 2020-08-13
KR102266529B1 (ko) 2021-06-17

Similar Documents

Publication Publication Date Title
US11151406B2 (en) Method, apparatus, device and readable storage medium for image-based data processing
CN110232340B (zh) 建立视频分类模型以及视频分类的方法、装置
CN114298121B (zh) 基于多模态的文本生成方法、模型训练方法和装置
CN110598603A (zh) 人脸识别模型获取方法、装置、设备和介质
CN115526259A (zh) 一种多模态预训练模型的训练方法和装置
US11645478B2 (en) Multi-lingual tagging for digital images
CN110263218B (zh) 视频描述文本生成方法、装置、设备和介质
CN112085120A (zh) 多媒体数据的处理方法、装置、电子设备及存储介质
CN117011581A (zh) 图像识别方法、介质、装置和计算设备
CN117093687A (zh) 问题应答方法和装置、电子设备、存储介质
CN109408175B (zh) 通用高性能深度学习计算引擎中的实时交互方法及系统
CN111125550B (zh) 兴趣点分类方法、装置、设备及存储介质
CN118350464A (zh) 基于任意粒度文本输入的对话式目标定位方法及装置
CN117235605B (zh) 一种基于多模态注意力融合的敏感信息分类方法及装置
CN112926700A (zh) 针对目标图像的类别识别方法和装置
CN112446360A (zh) 目标行为检测方法、装置及电子设备
CN116186255A (zh) 训练未知意图检测模型的方法、未知意图检测方法及装置
CN113177479B (zh) 图像分类方法、装置、电子设备及存储介质
CN114842261A (zh) 图像处理方法、装置、电子设备及存储介质
CN115221298A (zh) 问答匹配方法、装置、电子设备及存储介质
CN115510457A (zh) 数据识别方法、装置、设备及计算机程序产品
Kwon et al. An introduction to face-recognition methods and its implementation in software applications
Padmanand et al. Malaysian Sign Language Recognition Using 3D Hand Pose Estimation
CN117611845B (zh) 多模态数据的关联识别方法、装置、设备及存储介质
CN117557871B (zh) 三维模型标注方法、装置、设备及存储介质

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20191121

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20220301