CN109614613B - Image description statement positioning method and device, electronic equipment and storage medium - Google Patents

Image description statement positioning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109614613B
CN109614613B CN201811459428.7A CN201811459428A CN109614613B CN 109614613 B CN109614613 B CN 109614613B CN 201811459428 A CN201811459428 A CN 201811459428A CN 109614613 B CN109614613 B CN 109614613B
Authority
CN
China
Prior art keywords
image
sample
analyzed
sentence
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811459428.7A
Other languages
Chinese (zh)
Other versions
CN109614613A (en
Inventor
刘希慧
邵婧
王子豪
李鸿升
王晓刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN201811459428.7A priority Critical patent/CN109614613B/en
Publication of CN109614613A publication Critical patent/CN109614613A/en
Priority to KR1020207008623A priority patent/KR102454930B1/en
Priority to PCT/CN2019/086274 priority patent/WO2020107813A1/en
Priority to JP2020517564A priority patent/JP6968270B2/en
Priority to SG11202003836YA priority patent/SG11202003836YA/en
Priority to TW108142397A priority patent/TWI728564B/en
Priority to US16/828,226 priority patent/US11455788B2/en
Application granted granted Critical
Publication of CN109614613B publication Critical patent/CN109614613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5854Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/30Scenes; Scene-specific elements in albums, collections or shared content, e.g. social network photos or video
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1916Validation; Performance evaluation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure relates to a method and a device for locating descriptive sentences of images, electronic equipment and a storage medium. The method comprises the following steps: analyzing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed; obtaining a plurality of first matching scores according to the attention weights of the sentences and the main body characteristics, the position characteristics and the relation characteristics of the image to be analyzed; obtaining a second matching score between the descriptive statement to be analyzed and the image to be analyzed according to the first matching scores and the image attention weights; and determining the positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score. The embodiment of the disclosure can improve the accuracy of the positioning of the descriptive sentence in the image.

Description

Image description statement positioning method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer vision technologies, and in particular, to a method and an apparatus for locating descriptive sentences of an image, an electronic device, and a storage medium.
Background
The phrase location is an important issue in the cross-domain of computer vision and natural language processing, for example, a machine may be required to locate an object (person or object, etc.) described by a given phrase (sentence) in an image according to the sentence. In the related art, a combined modular network composed of a positioning module, a relation module and the like is proposed for identifying objects and relations thereof, however, the models may depend excessively on specific words or visual concepts and favor frequently observed evidence, resulting in poor correspondence between sentences and images.
Disclosure of Invention
The present disclosure provides a technical solution for positioning descriptive sentences of an image.
According to an aspect of the present disclosure, there is provided a descriptive sentence positioning method for an image, including: analyzing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed; obtaining a plurality of first matching scores according to the plurality of sentence attention weights and a main feature, a position feature and a relation feature of an image to be analyzed, wherein the image to be analyzed comprises a plurality of objects, the main object is an object with the highest attention weight in the plurality of objects, the main feature is a feature of the main object, the position feature is a position feature of the plurality of objects, and the relation feature is a relation feature among the plurality of objects; obtaining a second matching score between the descriptive statement to be analyzed and the image to be analyzed according to the first matching scores and the image attention weights; and determining a positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score.
In a possible implementation manner, analyzing a descriptive sentence to be analyzed and an image to be analyzed respectively to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed, includes: extracting the features of the image to be analyzed to obtain an image feature vector of the image to be analyzed; performing feature extraction on the descriptive statement to be analyzed to obtain participle embedded vectors of a plurality of participles of the descriptive statement to be analyzed; and obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the plurality of participles.
In one possible implementation, the method further includes: and acquiring a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
In one possible implementation, the plurality of sentence attention weights includes a sentence subject weight, a sentence position weight, and a sentence relationship weight, the neural network includes an image attention network, the image attention network includes a subject network, a position network, and a relationship network, the plurality of first matching scores includes a subject matching score, a position matching score, and a relationship matching score;
obtaining a plurality of first matching scores according to the plurality of sentence attention weights and the main body characteristics, the position characteristics and the relation characteristics of the image to be analyzed, wherein the first matching scores comprise; inputting the statement subject weight and subject characteristics into the subject network for processing to obtain the subject matching score; inputting the sentence position weight and the position characteristics into the position network for processing to obtain the position matching score; and inputting the statement relation weight and the relation characteristic into the relation network for processing to obtain the relation matching score.
In a possible implementation manner, the obtaining a second matching score between the statement to be analyzed and the image to be analyzed according to the plurality of first matching scores and the plurality of image attention weights includes:
and carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight, and determining the second matching score.
In one possible implementation, the method further includes: and inputting the image to be analyzed into a feature extraction network for processing to obtain the main body feature, the position feature and the relation feature.
In a possible implementation manner, determining a positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score includes: and determining the image area of the main object as the positioning position of the descriptive statement to be analyzed when the second matching score is greater than or equal to a preset threshold value.
In a possible implementation manner, before obtaining, by a neural network, a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed, the method further includes: training the neural network by using a sample set, wherein the sample set comprises a plurality of positive sample pairs and a plurality of negative sample pairs, each positive sample pair comprises a first sample image and a first sample description sentence thereof, and each negative sample pair comprises the first sample image and a second sample description sentence with a word segmentation removed from the first sample description sentence, or the first sample description sentence and the second sample image with a region removed from the first sample image.
In one possible implementation, the neural network further includes a language attention network, the method further comprising: inputting a first sample description sentence and a first sample image of the positive sample pair into the language attention network to obtain attention weights of a plurality of word segments of the first sample description sentence; replacing the participle with the highest attention weight in the first sample description sentence by a preset mark to obtain a second sample description sentence; and taking the first sample image and the second sample description sentence as a negative sample pair.
In one possible implementation, the method further includes: inputting a first sample description sentence and a first sample image of the positive sample pair into the image attention network to obtain the attention weight of the first sample image; removing an image area with the highest attention weight in the first sample image to obtain a second sample image; and taking the second sample image and the first sample description sentence as a negative sample pair.
In one possible implementation, training the neural network with a sample set includes: determining an overall loss of the neural network from the first loss and the second loss of the neural network; training the neural network according to the overall loss.
In one possible implementation, before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the first loss; the step of obtaining the first loss comprises: inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score; inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score; a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
In one possible implementation, before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the second loss; the step of obtaining the second loss comprises: inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score; inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score; inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score; inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score; and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
In one possible implementation, determining an overall loss of the neural network from the first loss and the second loss of the neural network comprises: and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
According to an aspect of the present disclosure, there is provided an image descriptive sentence positioning apparatus including: the first weight obtaining module is used for analyzing and processing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed; a first score obtaining module, configured to obtain a plurality of first matching scores according to the plurality of sentence attention weights and a subject feature, a position feature, and a relationship feature of an image to be analyzed, where the image to be analyzed includes a plurality of objects, a subject object is an object with a highest attention weight among the plurality of objects, the subject feature is a feature of the subject object, the position feature is a position feature of the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects; a second score obtaining module, configured to obtain a second matching score between the to-be-analyzed descriptive sentence and the to-be-analyzed image according to the plurality of first matching scores and the plurality of image attention weights; and the result determining module is used for determining the positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score.
In one possible implementation manner, the first weight obtaining module includes: the image feature extraction submodule is used for extracting features of the image to be analyzed to obtain an image feature vector of the image to be analyzed; the word segmentation feature extraction submodule is used for extracting features of the descriptive sentence to be analyzed to obtain word segmentation embedded vectors of a plurality of word segmentations of the descriptive sentence to be analyzed; and the first weight obtaining submodule is used for obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the participles.
In one possible implementation, the apparatus further includes: and the second weight obtaining module is used for obtaining a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
In one possible implementation, the plurality of sentence attention weights includes a sentence subject weight, a sentence position weight, and a sentence relationship weight, the neural network includes an image attention network, the image attention network includes a subject network, a position network, and a relationship network, the plurality of first matching scores includes a subject matching score, a position matching score, and a relationship matching score, and the first score obtaining module includes;
the first score obtaining sub-module is used for inputting the subject weight and the subject characteristics of the statement into the subject network for processing to obtain the subject matching score; the second score obtaining submodule is used for inputting the sentence position weight and the position characteristics into the position network for processing to obtain the position matching score; and the third score obtaining submodule is used for inputting the statement relation weight and the relation characteristics into the relation network for processing to obtain the relation matching score.
In one possible implementation manner, the plurality of image attention weights include a subject object weight, an object position weight, and an object relationship weight, and the second score obtaining module includes: and the fourth score obtaining sub-module is used for carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight so as to determine the second matching score.
In one possible implementation, the apparatus further includes: and the third weight obtaining module is used for inputting the image to be analyzed into a feature extraction network for processing to obtain the main body feature, the position feature and the relation feature.
In one possible implementation, the result determination module includes: and the position determining submodule is used for determining the image area of the main object as the positioning position of the descriptive statement to be analyzed under the condition that the second matching score is greater than or equal to a preset threshold value.
In a possible implementation manner, before the second weight obtaining module, the method further includes: the training module is used for training the neural network by utilizing a sample set, wherein the sample set comprises a plurality of positive sample pairs and a plurality of negative sample pairs, each positive sample pair comprises a first sample image and a first sample description sentence thereof, and each negative sample pair comprises the first sample image and a second sample description sentence with a word segmentation removed from the first sample description sentence, or the first sample description sentence and the second sample image with a region removed from the first sample image.
In one possible implementation, the neural network further includes a language attention network, and the apparatus further includes: a word segmentation weight determination module, configured to input the first sample description sentence and the first sample image of the positive sample pair into the language attention network, so as to obtain attention weights of a plurality of word segments of the first sample description sentence; the word segmentation replacement module is used for replacing the word segmentation with the highest attention weight in the first sample description sentence by adopting a preset mark to obtain a second sample description sentence; a first negative sample pair determination module for regarding the first sample image and the second sample description sentence as a negative sample pair.
In one possible implementation, the apparatus further includes: an image weight determining module, configured to input the first sample description statement and the first sample image of the positive sample pair into the image attention network, so as to obtain an attention weight of the first sample image; the region removing module is used for removing an image region with the highest attention weight in the first sample image to obtain a second sample image; a second negative sample pair determination module for regarding the second sample image and the first sample description sentence as a negative sample pair.
In one possible implementation, the training module includes: the overall loss determining sub-module is used for determining the overall loss of the neural network according to the first loss and the second loss of the neural network; and the training sub-module is used for training the neural network according to the total loss.
In a possible implementation manner, before the overall loss determining sub-module, the method further includes: a first loss obtaining sub-module for obtaining the first loss; the first loss acquisition submodule is configured to:
inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score; inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score; a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
In a possible implementation manner, before the overall loss determining sub-module, the method further includes: a second loss obtaining sub-module for obtaining the second loss; the second loss acquisition submodule is configured to:
inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score; inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score; inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score; inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score; and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
In one possible implementation, the overall loss determination sub-module is configured to: and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
In the embodiment of the disclosure, the sentence attention weight of the descriptive sentence to be analyzed and the image attention weight of the image to be analyzed can be obtained; obtaining a plurality of first matching scores according to the sentence attention weight and the main body characteristic, the position characteristic and the relation characteristic of the image; obtaining a second matching score according to the first matching score and the image attention weight; and determining a positioning result according to the second matching score, thereby fully finding the corresponding relation between the text and the visual semantics and improving the positioning accuracy of the descriptive sentence in the image.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a flowchart of a descriptive sentence localization method of an image according to an embodiment of the present disclosure.
Fig. 2 shows a schematic diagram of a neural network according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of obtaining a second sample descriptive statement in accordance with an embodiment of the present disclosure.
Fig. 4 shows a schematic diagram of obtaining a second sample image according to an embodiment of the present disclosure.
Fig. 5 shows a block diagram of a descriptive sentence locating apparatus of an image according to an embodiment of the present disclosure.
Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
In various embodiments of the present invention, the method for locating descriptive sentences in an image may be performed by an electronic device such as a terminal device or a server, the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer-readable instruction stored in a memory. Alternatively, the method may be performed by a server.
Fig. 1 shows a flowchart of a descriptive sentence localization method of an image according to an embodiment of the present disclosure. The method comprises the following steps:
in step S11, the descriptive sentence to be analyzed and the image to be analyzed are analyzed, and a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed are obtained.
In one possible implementation, a plurality of objects (persons, animals, objects, etc.) may be included in the image to be analyzed, such as a plurality of people riding a horse. The descriptive sentence to be analyzed may be a description for a certain object in the image to be analyzed, such as "a brown horse in the middle, which is ridden by girls". The image to be analyzed and the descriptive sentence to be analyzed may or may not correspond to each other. The association between the sentence and the image may be determined according to the method of the embodiments of the present disclosure.
In a possible implementation manner, the plurality of sentence attention weights of the descriptive sentence to be analyzed may include a sentence subject weight, a sentence position weight, and a sentence relation weight, which are respectively used for representing attention weights corresponding to different types of participles of the descriptive sentence to be analyzed.
In a possible implementation manner, the plurality of image attention weights of the image to be analyzed may include a subject object weight, an object position weight, and an object relationship weight, which are respectively used for representing attention weights corresponding to different types of image regions of the image to be analyzed.
In step S12, a plurality of first matching scores are obtained according to the plurality of sentence attention weights and a subject feature, a position feature and a relationship feature of an image to be analyzed, where the image to be analyzed includes a plurality of objects, a subject object is an object with the highest attention weight among the plurality of objects, the subject feature is a feature of the subject object, the position feature is a position feature of the plurality of objects, and the relationship feature is a relationship feature among the plurality of objects.
In one possible implementation, the image to be analyzed includes a plurality of objects (human, animal, object, etc.), and the subject object is an object with the highest attention weight among the plurality of objects. The subject feature is an image feature of the subject object itself, the position feature is a position feature representing a relative position between the plurality of objects, and the relationship feature is a relationship feature representing a relative relationship between the plurality of objects.
In one possible implementation, the plurality of first match scores may include a subject match score, a location match score, and a relationship match score. The subject matching score is used for evaluating the matching degree between a subject object in the image to be analyzed and the object description of the descriptive sentence to be analyzed; the position matching score evaluates the matching degree between the relative positions of a plurality of objects in the image to be analyzed and the position description of the descriptive sentence to be analyzed; the relationship matching score is used for evaluating the matching degree between the relevance of a plurality of objects in the image to be analyzed and the relevance description of the descriptive sentence to be analyzed.
In step S13, a second matching score between the descriptive sentence to be analyzed and the image to be analyzed is obtained according to the plurality of first matching scores and the plurality of image attention weights.
In a possible implementation manner, according to the subject matching score, the position matching score, and the relationship matching score, and the subject object weight, the object position weight, and the object relationship weight, a second matching score between the sentence to be described and the image to be analyzed may be obtained. The second matching score is used for evaluating the overall matching degree between the image to be analyzed and the descriptive sentence to be analyzed.
In step S14, according to the second matching score, a positioning result of the descriptive sentence to be analyzed in the image to be analyzed is determined.
In a possible implementation manner, after the second matching score is obtained, the positioning position of the descriptive sentence to be analyzed in the image to be analyzed may be further determined, so as to implement the positioning of the descriptive sentence in the image.
According to the embodiment of the disclosure, the sentence attention weight of the descriptive sentence to be analyzed and the image attention weight of the image to be analyzed can be obtained; obtaining a plurality of first matching scores according to the sentence attention weight and the main body characteristic, the position characteristic and the relation characteristic of the image; obtaining a second matching score according to the first matching score and the image attention weight; and determining a positioning result according to the second matching score, thereby fully finding the corresponding relation between the text and the visual semantics and improving the positioning accuracy of the descriptive sentence in the image.
In a possible implementation manner, in step S11, the descriptive sentence to be analyzed and the image to be analyzed may be analyzed, and a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed are obtained. Wherein, the step S11 may include:
extracting the features of the image to be analyzed to obtain an image feature vector of the image to be analyzed;
performing feature extraction on the descriptive statement to be analyzed to obtain participle embedded vectors of a plurality of participles of the descriptive statement to be analyzed;
and obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the plurality of participles.
For example, the feature extraction may be performed on the image to be analyzed and the descriptive sentence to be analyzed respectively. For the image to be analyzed, feature extraction can be carried out on all pixel points of the image to be analyzed to obtain an image feature vector e of the image to be analyzed0. The feature extraction mode of the image to be analyzed is not limited in the present disclosure.
In a possible implementation manner, for the descriptive sentence to be analyzed, word segmentation processing may be performed on the descriptive sentence to determine a plurality of words of the descriptive sentence to be analyzed, and feature extraction may be performed on each word segmentation to obtain word segmentation embedding vectors (word embedding) of the plurality of words
Figure GDA0002387090090000091
Wherein T represents the number of participles (T is an integer greater than 1), etAnd the T-th participle embedding vector is represented, and T is more than or equal to 1 and less than or equal to T. The specific word segmentation mode of the descriptive sentence to be analyzed and the specific mode of extracting the features of each segmented word are not limited in the present disclosure.
In a possible implementation manner, according to the determined image feature vector and the participle embedding vector of the plurality of participles, a plurality of sentence attention weights of the to-be-analyzed description sentence and a plurality of image attention weights of the to-be-analyzed image can be determined.
In a possible implementation manner, the method can further comprise the step of obtaining a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network, wherein the neural network can comprise a language attention network, the language attention network can be realized through a recurrent neural network RNN, a long-short term memory network L STM and other networks, the image to be analyzed, the descriptive sentence to be analyzed and the input language attention network can be processed, and the plurality of sentence attention weights and the plurality of image attention weights can be obtained.
In a possible implementation manner, the image to be analyzed, the descriptive sentence to be analyzed, and the input language attention network may be processed to obtain the plurality of sentence attention weights and the plurality of image attention weights.
For example, feature extraction may be performed through a feature extraction sub-network of a language attention network to obtain image feature vectors e respectively0And word segmentation embedded vector
Figure GDA0002387090090000101
The feature extraction sub-network may be a convolutional neural network CNN (e.g., fast CNN).
In one possible implementation, the language attention network may have an L STM network based on the attention mechanism0As the first stage input of L STM network, and embedding participles into vectors
Figure GDA0002387090090000102
As input to the various stages of the L STM network, to obtain the output states h of the multiple hidden layers of the L STM networkt
In one possible implementation, a plurality of states h are usedtCalculating the attention weight of the image and the attention weight of each participle; embedding vectors for multiple participles based on attention weights of the multiple participles
Figure GDA0002387090090000103
And weighted summation is carried out, so that the sentence attention weight can be obtained.
In one possible implementation, the profile to be analyzedThe plurality of sentence attention weights of the sentence are word-level attention weights (word-level attention weights), and can comprise a sentence subject weight qsubjSentence position weight qlocAnd statement relation weight qrelRespectively used for representing attention weights corresponding to different types of participles of the descriptive sentences to be analyzed.
Wherein the sentence subject weight is used to represent the attention weight when paying attention to the subject participle in the sentence, for example the attention weight of the participle "brown horse" or "horse" of the subject in the sentence "brown horse riding in the middle" the girl; the sentence position weight is used to represent an attention weight when a participle representing a position in a sentence is noticed, for example, an attention weight of a participle "in the middle" representing a position in the above-mentioned sentence; the sentence relation weight is used to represent an attention weight when a participle representing a relation between objects in a sentence is noticed, for example, an attention weight of a participle "riding by girls" representing a relation between objects in the above-described sentence.
In one possible implementation, the image attention weights of the image to be analyzed are module-level attention weights (module-level attention weights), which may include a subject object weight ωsubjObject position weight ωlocAnd object relation weight ωrelRespectively, for representing the attention weights corresponding to different types of image areas of the image to be analyzed.
Wherein the subject object weight may represent an attention weight when paying attention to the most important object (subject object) among a plurality of objects (human, animal, object, etc.) in the image, such as a person in the middle of the image; the object position weight may represent an attention weight when paying attention to relative positions of a plurality of objects in the image, such as the middle, left, and right positions of the image; the object relationship weight may represent an attention weight when attention is paid to the association between a plurality of objects in an image, for example, a person riding a horse in the middle, left, and right sides of the image.
In this way, different types of information in vision (images) and texts (sentences) can be captured through the language attention network, so that the corresponding relation between the images and the sentences in various aspects is found, and the processing precision is improved.
In one possible implementation, before step S12, the method further includes: and inputting the image to be analyzed into a feature extraction network for processing to obtain the main feature, the position feature and the relation feature of the image to be analyzed.
For example, the feature extraction network may be one or more predetermined convolutional neural networks CNN (e.g., fast R-CNN) for extracting the main feature, the position feature and the relationship feature of the image to be analyzed. All pixel points of the image to be analyzed can be input into the feature extraction network, and the feature map before ROI pooling is used as the integral image feature of the image to be analyzed.
In one possible implementation, with respect to the subject feature, a plurality of objects in the image to be analyzed may be identified, and an object with the highest attention weight among the plurality of regions is extracted as the subject object, and a feature map of the region of the subject object is determined as the subject feature, for example, a feature map of 7 × 7 is extracted as the subject feature.
In one possible implementation, regarding the position feature, the position feature may be obtained according to the relative position offset and the relative area between the image areas where the plurality of objects are located in the image to be analyzed, and the position and the relative area of the object itself.
In one possible implementation, with respect to the relationship features, the relationship features between the context object(s) may be determined from the visual features pooled by the mean values in the region suggestions, the relative position offset, and the connection between the relative regions.
It should be understood that the present disclosure is not limited to the specific manner of extracting the main body feature, the position feature and the relationship feature of the image to be analyzed.
In one possible implementation manner, a plurality of first matching scores may be obtained in step S12 according to the plurality of sentence attention weights and the subject feature, the position feature and the relationship feature of the image to be analyzed.
For example, a plurality of first match scores may be obtained by a neural network. The neural network may include an image attention network including a subject network, a location network, and a relationship network. The main network, the position network and the relationship network may be a pre-constructed convolutional neural network CNN, respectively.
The subject network is used for evaluating the matching degree between the most important object (subject object) in a plurality of objects (people, animals, objects and the like) in the image to be analyzed and the object description of the descriptive statement to be analyzed; the position network is used for evaluating the matching degree between the relative positions of a plurality of objects in the image to be analyzed and the position description of the descriptive statement to be analyzed; the relation network is used for evaluating the matching degree between the relevance of a plurality of objects in the image to be analyzed and the relevance description of the descriptive statement to be analyzed.
In a possible implementation manner, the plurality of sentence attention weights and the main body feature, the position feature and the relationship feature of the image to be analyzed can be respectively input into a main body network, a position network and a relationship network for processing, so as to evaluate the matching degree of various aspects of the image and the sentence.
The main object is an object with the highest attention weight in a plurality of objects of an image to be analyzed, the main feature is a feature of the main object, the position feature is a position feature of the plurality of objects, and the relation feature is a relation feature among the plurality of objects.
In one possible implementation, the plurality of first matching scores obtained in step S12 may include a subject matching score, a location matching score, and a relationship matching score.
In one possible implementation, step S12 may include: inputting the sentence subject weight and the subject characteristics into the subject network for processing to obtain a subject matching score; inputting the sentence position weight and the position characteristics into the position network for processing to obtain a position matching score; and inputting the statement relation weight and the relation characteristics into the relation network for processing to obtain a relation matching score.
In this embodiment, the sentence subject weight and the subject feature are input into the subject network, so that the matching degree between the subject of the to-be-analyzed description sentence and the subject object of the to-be-analyzed image can be analyzed to obtain a subject matching score; inputting the sentence position weight and the position characteristics into a position network, and analyzing the matching degree between the position participle of the sentence to be analyzed and the relative positions of a plurality of objects of the image to be analyzed to obtain a position matching score; the sentence relation weight and the relation characteristics are input into the relation network, the matching degree between the relation participle of the sentence to be described and analyzed and the relevance of a plurality of objects of the image to be analyzed can be analyzed, and the relation matching score is obtained.
For example, multiple sentence attention weights (sentence subject weight q)subjSentence position weight qlocAnd statement relation weight qrel) And a plurality of object characteristics (subject characteristics, location characteristics, relationship characteristics) which are respectively input into the subject network, the location network and the relationship network for processing.
By the method, the matching degree of each aspect of the image and the descriptive sentence can be determined, and the accuracy of matching judgment is improved.
In one possible implementation manner, a second matching score between the descriptive sentence to be analyzed and the image to be analyzed may be obtained in step S13 according to the plurality of first matching scores and the plurality of image attention weights. That is, the subject matching score, the position matching score, the relationship matching score, and the subject weight ωsubjObject position weight ωlocAnd object relation weight ωrelAnd obtaining a second matching score between the descriptive sentence to be analyzed and the image to be analyzed.
Wherein, the step S13 may include:
and carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight, and determining the second matching score.
For example, after obtaining the subject matching score, the location matching score and the relationship matching score, the subject object weight ω can be determinedsubjObject position weight ωlocAnd object relationship rightsHeavy omegarelAnd respectively weighting the subject matching score, the position matching score and the relationship matching score, summing the weighted scores and then averaging. The average may be determined as a second matching score between the descriptive sentence to be analyzed and the image to be analyzed.
In this way, an accurate matching score between the descriptive sentence to be analyzed and the image to be analyzed can be obtained.
In one possible implementation manner, the positioning result of the descriptive sentence to be analyzed in the image to be analyzed may be determined according to the second matching score in step S14. That is, after the second matching score is obtained, the positioning result of the descriptive sentence to be analyzed in the image to be analyzed may be further determined. Wherein, the step S14 may include:
and determining the image area of the main object as the positioning position of the descriptive statement to be analyzed when the second matching score is greater than or equal to a preset threshold value.
For example, a threshold of the matching score may be preset (for example, the preset threshold is 70 points), if the second matching score is greater than or equal to the preset threshold, the descriptive sentence to be analyzed may be regarded as the description of the subject object in the image to be analyzed, and the image area where the subject object is located may be determined as the location position of the descriptive sentence to be analyzed. On the contrary, if the second matching score is smaller than the preset threshold, the descriptive statement to be analyzed may be considered not to be a description of the subject object in the image to be analyzed, and the positioning result may be determined as being unable to correspond to the subject object. It should be understood that the preset threshold value can be set by a person skilled in the art according to practical situations, and the specific value of the preset threshold value is not limited by the present disclosure.
In one possible implementation, a plurality of subject objects may be set in the image to be analyzed, subject features of each subject object may be input into the image attention network for processing, a second matching score of each subject object may be determined, and a highest score of the plurality of second matching scores may be determined. In this case, the descriptive sentence to be analyzed may be regarded as a description of the subject object corresponding to the highest score, and the image area where the subject object is located may be determined as the positioning position of the descriptive sentence to be analyzed.
In this way, the precise positioning of the descriptive sentence to be analyzed in the image to be analyzed can be realized.
Fig. 2 shows a schematic diagram of a neural network according to an embodiment of the present disclosure. As shown in fig. 2, the neural network may include a language attention network 21 and an image attention network including a subject network 22, a location network 23, and a relationship network 24.
In this example, the descriptive sentence to be analyzed "brown horse in the middle which is ridden by girls" 201 and the image to be analyzed 202 are inputted into the language attention network 21 to be processed, and three image attention weights (subject object weight ω) may be outputtedsubjObject position weight ωlocAnd object relation weight ωrel) Simultaneously outputting three sentence attention weights (sentence subject weight q)subjSentence position weight qlocAnd statement relation weight qrel)。
In this example, the subject feature 203, the position feature 204, and the relationship feature 205 of the image to be analyzed may be obtained through a feature extraction network (not shown).
In this example, the sentence subject weight qsubjAnd subject characteristics 203 are input into the subject network 22 for processing, and subject matching scores can be obtained; weighting the sentence position by qlocAnd the location features 204 are processed in the location network 23 to obtain location matching scores; relation weight q of statementrelAnd the relationship characteristics 205 are input into the relationship network 24 for processing to obtain a relationship matching score.
In this example, the subject object weight ω is based onsubjObject position weight ωlocAnd object relation weight ωrelRespectively weighting the subject matching score, the position matching score and the relationship matching score, summing the weighted scores and averaging to obtain a second matching score 206, and determining a positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score 206, thereby completing the whole implementation process of the steps S11-S14.
It should be understood that the above is only one example of a neural network that implements the methods of the present disclosure, and that the present disclosure is not limited to a particular type of neural network.
In one possible implementation, before step S11, the method further includes: training the neural network with a sample set, the sample set comprising a plurality of positive sample pairs and a plurality of negative sample pairs.
Wherein each positive sample pair comprises a first sample image and a first sample description sentence thereof,
each negative sample pair comprises a first sample image and a second sample descriptive sentence with a word segmentation removed from the first sample descriptive sentence, or the first sample descriptive sentence and the second sample image with a region removed from the first sample image.
In one possible implementation manner, the visual or text information with high attention weight can be removed through a cross-modal removal manner based on attention guidance to obtain the removed training samples (the second sample description sentence and the second sample image), so that the training precision is improved.
For example, a sample set including a plurality of training samples may be preset in order to train a neural network. The sample set includes a plurality of positive sample pairs, each positive sample pair including a first sample image O and a first sample description sentence Q thereof. A sentence describing an object in the first sample image may be taken as the first sample description sentence in the same positive sample pair. The sample set can also comprise a plurality of negative sample pairs, and each negative sample pair comprises a first sample image and a second sample descriptive sentence with a word segmentation removed from the first sample descriptive sentence, or the first sample descriptive sentence and the second sample image with a region removed from the first sample image. The present disclosure does not limit the specific establishment manner of the sample set, and the present disclosure does not limit the sequence between the sample image and the sample description sentence in each sample pair.
In one possible implementation, the method may further include:
inputting a first sample description sentence and a first sample image of the positive sample pair into the language attention network to obtain attention weights of a plurality of word segments of the first sample description sentence;
replacing the participle with the highest attention weight in the first sample description sentence by a preset mark to obtain a second sample description sentence;
and taking the first sample image and the second sample description sentence as a negative sample pair.
In a possible implementation mode, the spatial attention guidance can be carried out through the language attention network, the most important text information is removed, and a hard text training sample is obtained, so that the neural network is prevented from excessively depending on specific text information (word segmentation), and the precision of the trained neural network is improved
Fig. 3 shows a schematic diagram of obtaining a second sample descriptive statement in accordance with an embodiment of the present disclosure. For example, as shown in fig. 3, a first sample descriptive sentence of a positive sample pair (e.g., "brown horse in the middle that is ridden by girls") and a first sample image (e.g., a picture that includes multiple people that ride the horse) may be input into the language attention network, resulting in the attention weights of the multiple participles of the first sample descriptive sentence. From the attention weights of the individual participles, the most attention weighted participle (e.g., "middle") may be determined. Since directly removing the word "middle" may cause grammatical errors and cannot be recognized, the word "middle" may be replaced with an unknown mark, resulting in a second sample description sentence Q (in an "unknown" horse of brown color ridden by girls), such that the first sample image and the second sample description sentence may be regarded as a negative sample pair.
In one possible implementation, the method may further include:
inputting a first sample description sentence and a first sample image of the positive sample pair into the image attention network to obtain the attention weight of the first sample image;
removing an image area with the highest attention weight in the first sample image to obtain a second sample image;
and taking the second sample image and the first sample description sentence as a negative sample pair.
In a possible implementation mode, the most important visual information can be identified and removed through the image attention network, and a difficult image training sample is obtained, so that the neural network is prevented from excessively depending on specific visual information, and the precision of the trained neural network is improved.
Fig. 4 shows a schematic diagram of obtaining a second sample image according to an embodiment of the present disclosure. For example, as shown in FIG. 4, a first sample image (e.g., a picture including a plurality of people riding a horse) and a first sample descriptive sentence (e.g., "brown horse in the middle, by girl") of a positive sample pair may be entered into the image attention network for processing. The image attention network may be a main network, or a location network or a relationship network, which is not limited in this disclosure.
In one possible implementation, the attention weight of each region of the first sample image can be obtained by inputting the first sample image and the first sample description sentence into the subject network. From the attention weights of the respective regions, a target region with the highest attention weight (e.g., an image region where a girl is located in the middle) can be determined. Removing the target region from the first sample image may result in a second sample image O (as shown in fig. 4), such that the second sample image and the first sample description sentence may be regarded as a negative sample pair.
In one possible implementation, the step of training the neural network with a sample set may include: determining an overall loss of the neural network from the first loss and the second loss of the neural network.
In one possible implementation, the network loss of the positive sample pair (the first sample image and its first sample description sentence) may be obtained as the first loss. And acquiring the network loss of the removed negative sample pair (the second sample image and the first sample description sentence, or the first sample image and the second sample description sentence).
In one possible implementation, the step of training the neural network using the sample set may further include: training the neural network according to the overall loss.
In one possible implementation, the neural network may be trained based on the total network loss L after the total network loss L is obtained.
In one possible implementation, before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the first loss.
The step of obtaining the first loss comprises:
inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score; inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score; a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
For example, the network loss of a positive sample pair (the first sample image and its first sample description statement) may be obtained. For any positive sample pair in the training set, the same positive sample pair (O) can be usedi,Qi) First sample image OiAnd a first sample description statement QiInputting into the neural network shown in FIG. 2, and processing to obtain a first training score s (O)i,Qi). Wherein i is the sample number, i is more than or equal to 1 and less than or equal to N, and N is the number of positive sample pairs in the sample set.
In one possible implementation, the first sample image for different positive sample pairs and the first sample description statement (O) not corresponding theretoi,Qj) It can be input into the neural network shown in FIG. 2 for processing to obtain a second training score s (O)i,Qj). Wherein j is the sample number, j is more than or equal to 1 and less than or equal to N, and j is not equal to i. Likewise, a first sample image and a first sample descriptive statement (O) of different positive sample pairsj,Qi) Inputting the neural network to obtain another second training score s (O)j,Qi)。
In one possible implementation, the positive sample pairs (the first sample image and the first sample description sentence) in the training set are processed respectively to obtain a plurality of first training scores and a plurality of second training scores, and further to obtain a first loss L of the original samplerank
Figure GDA0002387090090000161
In equation (1), the operator [ x ]]+Can represent taking the maximum value between x and 0, that is, taking the value of x when x is greater than 0 and taking 0 when x is less than or equal to 0; m may be a constant to represent the spacing of the network losses. It should be understood that a person skilled in the art can set the value of m (e.g. 0.1) according to practical situations, and the specific value of m is not limited by the present disclosure.
In one possible implementation, before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the second loss;
the step of obtaining the second loss comprises:
inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score; inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score; inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score; inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score; and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
For example, the network loss of the removed negative examples (second example image and second example description sentence) may be obtained. For the same negative sample pair in the training set
Figure GDA0002387090090000171
The second sample image can be processed
Figure GDA0002387090090000172
And a first sample description statement QiInputting into the neural network shown in FIG. 2, processing to obtain a third training score
Figure GDA0002387090090000173
Wherein i is the sample number, i is more than or equal to 1 and less than or equal to N, and N is the number of sample pairs in the sample set.
In one possible implementation, for different pairs of negative samples (second sample images) in the training set
Figure GDA0002387090090000174
And a non-corresponding first sample description statement Qj) It can be input into the neural network shown in FIG. 2 for processing to obtain a fourth training score
Figure GDA0002387090090000175
Wherein j is the sample number, j is more than or equal to 1 and less than or equal to N, and j is not equal to i.
Similarly, the same negative sample pair
Figure GDA0002387090090000176
The first sample image and the corresponding second sample description sentence are input into the neural network, and a fifth training score can be obtained
Figure GDA0002387090090000177
Couple different negative samples
Figure GDA0002387090090000178
The first sample image and the second sample description sentence are input into the neural network to obtain a sixth training score
Figure GDA0002387090090000179
In one possible implementation, pairs of positive samples (first samples) in the training set are pairedThe image and the first sample description sentence) and the removed negative sample pair are respectively processed to obtain a plurality of third training scores, a plurality of fourth training scores, a plurality of fifth training scores and a plurality of sixth training scores, and further obtain a second loss L of the removed sampleerase
Figure GDA00023870900900001710
In equation (2), the operator [ x ]]+Can represent taking the maximum value between x and 0, that is, taking the value of x when x is greater than 0 and taking 0 when x is less than or equal to 0; m may be a constant to represent the spacing of the network losses. It should be understood that a person skilled in the art can set the value of m (e.g. 0.1) according to practical situations, and the specific value of m is not limited by the present disclosure.
In one possible implementation, after determining the first loss and the second loss, an overall loss of the neural network may be determined according to the first loss and the second loss, and the neural network may be trained according to the overall loss.
Wherein determining the overall loss of the neural network from the first loss and the second loss of the neural network may comprise: and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
For example, the overall network loss L for the neural network may be calculated by the following formula:
L=βLerase+γLrank(3)
in equation (3), β and γ represent the weights of the first loss and the second loss, respectively, it should be understood that the values of β and γ can be set by one skilled in the art according to practical circumstances, and the specific values of β and γ are not limited by the present disclosure.
In one possible implementation, the neural network may be trained based on the total network loss L after obtaining the total network loss L. for example, the network parameter values of the neural network may be adjusted using an inverse gradient method based on the total network loss L, and the total network loss L may be obtained again.
According to the descriptive statement positioning method of the image, the most main visual or text information with high attention weight is eliminated in a cross-mode erasing mode to generate a difficult training sample, so that the neural network model is driven to search for supplementary evidence except the most main evidence. According to the embodiment of the disclosure, the erased image of the original query statement is utilized, or the erased query statement of the original image is utilized to form a difficult training sample, so that the neural network model better utilizes the training data to learn the potential text-picture correspondence, and the inference complexity is not increased.
According to the embodiment of the disclosure, the method and the device can be applied to terminals such as robots or mobile phones, and the positions of people in the images can be located according to the guidance (characters or voice) of the people, so that the accurate correspondence between the texts and the images is realized.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.
Fig. 5 is a block diagram illustrating an image descriptive sentence positioning apparatus according to an embodiment of the present disclosure, which includes, as shown in fig. 5:
a first weight obtaining module 51, configured to analyze a descriptive sentence to be analyzed and an image to be analyzed, and obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed;
a first score obtaining module 52, configured to obtain a plurality of first matching scores according to the plurality of sentence attention weights and a subject feature, a position feature, and a relationship feature of an image to be analyzed, where the image to be analyzed includes a plurality of objects, a subject object is an object with the highest attention weight among the plurality of objects, the subject feature is a feature of the subject object, the position feature is a position feature of the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects;
a second score obtaining module 53, configured to obtain a second matching score between the to-be-analyzed descriptive sentence and the to-be-analyzed image according to the plurality of first matching scores and the plurality of image attention weights;
and a result determining module 54, configured to determine, according to the second matching score, a positioning result of the descriptive sentence to be analyzed in the image to be analyzed.
In one possible implementation manner, the first weight obtaining module includes:
the image feature extraction submodule is used for extracting features of the image to be analyzed to obtain an image feature vector of the image to be analyzed;
the word segmentation feature extraction submodule is used for extracting features of the descriptive sentence to be analyzed to obtain word segmentation embedded vectors of a plurality of word segmentations of the descriptive sentence to be analyzed;
and the first weight obtaining submodule is used for obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the participles.
In one possible implementation, the apparatus further includes: and the second weight obtaining module is used for obtaining a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
In one possible implementation, the plurality of sentence attention weights includes a sentence subject weight, a sentence position weight, and a sentence relationship weight, the neural network includes an image attention network, the image attention network includes a subject network, a position network, and a relationship network, the plurality of first matching scores includes a subject matching score, a position matching score, and a relationship matching score, and the first score obtaining module includes;
the first score obtaining sub-module is used for inputting the subject weight and the subject characteristics of the statement into the subject network for processing to obtain the subject matching score;
the second score obtaining submodule is used for inputting the sentence position weight and the position characteristics into the position network for processing to obtain the position matching score;
and the third score obtaining submodule is used for inputting the statement relation weight and the relation characteristics into the relation network for processing to obtain the relation matching score.
In one possible implementation manner, the plurality of image attention weights include a subject object weight, an object position weight, and an object relationship weight, and the second score obtaining module includes:
and the fourth score obtaining sub-module is used for carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight so as to determine the second matching score.
In one possible implementation, the apparatus further includes:
and the third weight obtaining module is used for inputting the image to be analyzed into a feature extraction network for processing to obtain the main body feature, the position feature and the relation feature.
In one possible implementation, the result determination module includes:
and the position determining submodule is used for determining the image area of the main object as the positioning position of the descriptive statement to be analyzed under the condition that the second matching score is greater than or equal to a preset threshold value.
In a possible implementation manner, before the second weight obtaining module, the method further includes: a training module to train the neural network with a sample set, the sample set including a plurality of positive sample pairs and a plurality of negative sample pairs,
wherein each positive sample pair comprises a first sample image and a first sample description sentence thereof,
each negative sample pair comprises a first sample image and a second sample descriptive sentence with a word segmentation removed from the first sample descriptive sentence, or the first sample descriptive sentence and the second sample image with a region removed from the first sample image.
In one possible implementation, the neural network further includes a language attention network, and the apparatus further includes:
a word segmentation weight determination module, configured to input the first sample description sentence and the first sample image of the positive sample pair into the language attention network, so as to obtain attention weights of a plurality of word segments of the first sample description sentence;
the word segmentation replacement module is used for replacing the word segmentation with the highest attention weight in the first sample description sentence by adopting a preset mark to obtain a second sample description sentence;
a first negative sample pair determination module for regarding the first sample image and the second sample description sentence as a negative sample pair.
In one possible implementation, the apparatus further includes:
an image weight determining module, configured to input the first sample description statement and the first sample image of the positive sample pair into the image attention network, so as to obtain an attention weight of the first sample image;
the region removing module is used for removing an image region with the highest attention weight in the first sample image to obtain a second sample image;
a second negative sample pair determination module for regarding the second sample image and the first sample description sentence as a negative sample pair.
In one possible implementation, the training module includes:
the overall loss determining sub-module is used for determining the overall loss of the neural network according to the first loss and the second loss of the neural network;
and the training sub-module is used for training the neural network according to the total loss.
In a possible implementation manner, before the overall loss determining sub-module, the method further includes: a first loss obtaining sub-module for obtaining the first loss; the first loss acquisition submodule is configured to:
inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score;
inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score;
a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
In a possible implementation manner, before the overall loss determining sub-module, the method further includes: a second loss obtaining sub-module for obtaining the second loss; the second loss acquisition submodule is configured to:
inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score;
inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score;
inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score;
inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score;
and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
In one possible implementation, the overall loss determination sub-module is configured to:
and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and for specific implementation, reference may be made to the description of the above method embodiments, and for brevity, details are not described here again
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.
An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.
The electronic device may be provided as a terminal, server, or other form of device.
Fig. 6 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.
Referring to fig. 6, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.
Fig. 7 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The electronic device 1900 may further include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input-output (I/O) interface 1958 the electronic device 1900 may be operable based on an operating system stored in memory 1932, such as Windows server, Mac OS XTM, UnixTM, &lttttranslation = L "&ttt/t &gtttranslation & &l &, FreeBSdtm or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including AN object oriented programming language such as Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (30)

1. A descriptive sentence positioning method for an image, comprising:
analyzing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed;
obtaining a plurality of first matching scores according to the plurality of sentence attention weights and a main feature, a position feature and a relation feature of an image to be analyzed, wherein the image to be analyzed comprises a plurality of objects, the main object is an object with the highest attention weight in the plurality of objects, the main feature is a feature of the main object, the position feature is a position feature of the plurality of objects, and the relation feature is a relation feature among the plurality of objects;
obtaining a second matching score between the descriptive statement to be analyzed and the image to be analyzed according to the first matching scores and the image attention weights;
determining a positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score,
the plurality of first matching scores comprise a subject matching score, a position matching score and a relationship matching score, the subject matching score is used for evaluating the matching degree between a subject object in the image to be analyzed and the object description of the statement to be analyzed, the position matching score is used for evaluating the matching degree between the relative positions of a plurality of objects in the image to be analyzed and the position description of the statement to be analyzed, and the relationship matching score is used for evaluating the matching degree between the relevance of the plurality of objects in the image to be analyzed and the relevance description of the statement to be analyzed.
2. The method of claim 1, wherein analyzing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed respectively comprises:
extracting the features of the image to be analyzed to obtain an image feature vector of the image to be analyzed;
performing feature extraction on the descriptive statement to be analyzed to obtain participle embedded vectors of a plurality of participles of the descriptive statement to be analyzed;
and obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the plurality of participles.
3. The method of claim 1, further comprising: and acquiring a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
4. The method of claim 3, wherein the plurality of sentence attention weights comprises a sentence subject weight, a sentence location weight, and a sentence relationship weight, wherein the neural network comprises an image attention network comprising a subject network, a location network, and a relationship network;
obtaining a plurality of first matching scores according to the plurality of sentence attention weights and the main body characteristics, the position characteristics and the relation characteristics of the image to be analyzed, wherein the first matching scores comprise;
inputting the statement subject weight and subject characteristics into the subject network for processing to obtain the subject matching score;
inputting the sentence position weight and the position characteristics into the position network for processing to obtain the position matching score;
and inputting the statement relation weight and the relation characteristic into the relation network for processing to obtain the relation matching score.
5. The method of claim 4, wherein the plurality of image attention weights comprise a subject object weight, an object position weight and an object relationship weight, and obtaining a second matching score between the descriptive sentence to be analyzed and the image to be analyzed according to the plurality of first matching scores and the plurality of image attention weights comprises:
and carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight, and determining the second matching score.
6. The method of claim 1, further comprising:
and inputting the image to be analyzed into a feature extraction network for processing to obtain the main body feature, the position feature and the relation feature.
7. The method of claim 1, wherein determining the positioning result of the descriptive sentence to be analyzed in the image to be analyzed according to the second matching score comprises:
and determining the image area of the main object as the positioning position of the descriptive statement to be analyzed when the second matching score is greater than or equal to a preset threshold value.
8. The method of claim 3, further comprising, before obtaining the plurality of sentence attention weights for the descriptive sentence to be analyzed and the plurality of image attention weights for the image to be analyzed through a neural network: training the neural network with a sample set comprising a plurality of positive sample pairs and a plurality of negative sample pairs,
wherein each positive sample pair comprises a first sample image and a first sample description sentence thereof,
each negative sample pair comprises a first sample image and a second sample descriptive sentence with a word segmentation removed from the first sample descriptive sentence, or the first sample descriptive sentence and the second sample image with a region removed from the first sample image.
9. The method of claim 8, wherein the neural network further comprises a linguistic attention network, the method further comprising:
inputting a first sample description sentence and a first sample image of the positive sample pair into the language attention network to obtain attention weights of a plurality of word segments of the first sample description sentence;
replacing the participle with the highest attention weight in the first sample description sentence by a preset mark to obtain a second sample description sentence;
and taking the first sample image and the second sample description sentence as a negative sample pair.
10. The method according to claim 8 or 9, characterized in that the method further comprises:
inputting a first sample description sentence and a first sample image of the positive sample pair into the image attention network to obtain the attention weight of the first sample image;
removing an image area with the highest attention weight in the first sample image to obtain a second sample image;
and taking the second sample image and the first sample description sentence as a negative sample pair.
11. The method of claim 8, wherein training the neural network with a sample set comprises:
determining an overall loss of the neural network from the first loss and the second loss of the neural network;
training the neural network according to the overall loss.
12. The method of claim 11, further comprising, prior to determining an overall loss of the neural network from the first loss and the second loss of the neural network: obtaining the first loss;
the step of obtaining the first loss comprises:
inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score;
inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score;
a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
13. The method of claim 11, further comprising, prior to determining an overall loss of the neural network from the first loss and the second loss of the neural network: obtaining the second loss;
the step of obtaining the second loss comprises:
inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score;
inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score;
inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score;
inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score;
and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
14. The method of any one of claims 11-13, wherein determining the overall loss of the neural network from the first loss and the second loss of the neural network comprises:
and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
15. An apparatus for locating descriptive sentence in image, comprising:
the first weight obtaining module is used for analyzing and processing the descriptive sentence to be analyzed and the image to be analyzed to obtain a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed;
a first score obtaining module, configured to obtain a plurality of first matching scores according to the plurality of sentence attention weights and a subject feature, a position feature, and a relationship feature of an image to be analyzed, where the image to be analyzed includes a plurality of objects, a subject object is an object with a highest attention weight among the plurality of objects, the subject feature is a feature of the subject object, the position feature is a position feature of the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects;
a second score obtaining module, configured to obtain a second matching score between the to-be-analyzed descriptive sentence and the to-be-analyzed image according to the plurality of first matching scores and the plurality of image attention weights;
a result determining module, configured to determine, according to the second matching score, a positioning result of the descriptive sentence to be analyzed in the image to be analyzed,
the plurality of first matching scores comprise a subject matching score, a position matching score and a relationship matching score, the subject matching score is used for evaluating the matching degree between a subject object in the image to be analyzed and the object description of the statement to be analyzed, the position matching score is used for evaluating the matching degree between the relative positions of a plurality of objects in the image to be analyzed and the position description of the statement to be analyzed, and the relationship matching score is used for evaluating the matching degree between the relevance of the plurality of objects in the image to be analyzed and the relevance description of the statement to be analyzed.
16. The apparatus of claim 15, wherein the first weight obtaining module comprises:
the image feature extraction submodule is used for extracting features of the image to be analyzed to obtain an image feature vector of the image to be analyzed;
the word segmentation feature extraction submodule is used for extracting features of the descriptive sentence to be analyzed to obtain word segmentation embedded vectors of a plurality of word segmentations of the descriptive sentence to be analyzed;
and the first weight obtaining submodule is used for obtaining a plurality of sentence attention weights of the to-be-analyzed descriptive sentence and a plurality of image attention weights of the to-be-analyzed image according to the image feature vector and the participle embedding vectors of the participles.
17. The apparatus of claim 15, further comprising: and the second weight obtaining module is used for obtaining a plurality of sentence attention weights of the descriptive sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
18. The apparatus of claim 17, wherein the plurality of sentence attention weights comprises a sentence subject weight, a sentence location weight, and a sentence relationship weight, wherein the neural network comprises an image attention network comprising a subject network, a location network, and a relationship network,
the first score obtaining module comprises;
the first score obtaining sub-module is used for inputting the subject weight and the subject characteristics of the statement into the subject network for processing to obtain the subject matching score;
the second score obtaining submodule is used for inputting the sentence position weight and the position characteristics into the position network for processing to obtain the position matching score;
and the third score obtaining submodule is used for inputting the statement relation weight and the relation characteristics into the relation network for processing to obtain the relation matching score.
19. The apparatus of claim 18, wherein the plurality of image attention weights comprises a subject object weight, an object location weight, and an object relationship weight, and wherein the second score obtaining module comprises:
and the fourth score obtaining sub-module is used for carrying out weighted average on the subject matching score, the position matching score and the relation matching score according to the subject object weight, the object position weight and the object relation weight so as to determine the second matching score.
20. The apparatus of claim 15, further comprising:
and the third weight obtaining module is used for inputting the image to be analyzed into a feature extraction network for processing to obtain the main body feature, the position feature and the relation feature.
21. The apparatus of claim 15, wherein the result determination module comprises:
and the position determining submodule is used for determining the image area of the main object as the positioning position of the descriptive statement to be analyzed under the condition that the second matching score is greater than or equal to a preset threshold value.
22. The apparatus of claim 17, wherein before the second weight obtaining module, further comprising: a training module to train the neural network with a sample set, the sample set including a plurality of positive sample pairs and a plurality of negative sample pairs,
wherein each positive sample pair comprises a first sample image and a first sample description sentence thereof,
each negative sample pair comprises a first sample image and a second sample descriptive sentence with a word segmentation removed from the first sample descriptive sentence, or the first sample descriptive sentence and the second sample image with a region removed from the first sample image.
23. The apparatus of claim 22, wherein the neural network further comprises a verbal attention network, the apparatus further comprising:
a word segmentation weight determination module, configured to input the first sample description sentence and the first sample image of the positive sample pair into the language attention network, so as to obtain attention weights of a plurality of word segments of the first sample description sentence;
the word segmentation replacement module is used for replacing the word segmentation with the highest attention weight in the first sample description sentence by adopting a preset mark to obtain a second sample description sentence;
a first negative sample pair determination module for regarding the first sample image and the second sample description sentence as a negative sample pair.
24. The apparatus of claim 22 or 23, further comprising:
an image weight determining module, configured to input the first sample description statement and the first sample image of the positive sample pair into the image attention network, so as to obtain an attention weight of the first sample image;
the region removing module is used for removing an image region with the highest attention weight in the first sample image to obtain a second sample image;
a second negative sample pair determination module for regarding the second sample image and the first sample description sentence as a negative sample pair.
25. The apparatus of claim 22, wherein the training module comprises:
the overall loss determining sub-module is used for determining the overall loss of the neural network according to the first loss and the second loss of the neural network;
and the training sub-module is used for training the neural network according to the total loss.
26. The apparatus of claim 25, further comprising, prior to the overall loss determination sub-module: a first loss obtaining sub-module for obtaining the first loss;
the first loss acquisition submodule is configured to:
inputting a first sample image and a first sample description sentence of the same positive sample pair into the neural network for processing to obtain a first training score;
inputting first sample images and first sample description sentences of different positive sample pairs into the neural network for processing to obtain a second training score;
a first loss is obtained based on the plurality of first training scores and the plurality of second training scores.
27. The apparatus of claim 25, further comprising, prior to the overall loss determination sub-module: a second loss obtaining sub-module for obtaining the second loss;
the second loss acquisition submodule is configured to:
inputting a second sample image and a first sample description sentence of the same negative sample pair into the neural network for processing to obtain a third training score;
inputting second sample images and first sample description sentences of different negative sample pairs into the neural network for processing to obtain a fourth training score;
inputting a first sample image and a second sample description sentence of the same negative sample pair into the neural network for processing to obtain a fifth training score;
inputting the first sample image and the second sample description sentence of different negative sample pairs into the neural network for processing to obtain a sixth training score;
and obtaining a second loss according to the plurality of third training scores, the plurality of fourth training scores, the plurality of fifth training scores and the plurality of sixth training scores.
28. The apparatus of any one of claims 25-27, wherein the overall loss determination submodule is configured to:
and performing weighted superposition on the first loss and the second loss to obtain the total loss of the neural network.
29. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to: performing the method of any one of claims 1 to 14.
30. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 14.
CN201811459428.7A 2018-11-30 2018-11-30 Image description statement positioning method and device, electronic equipment and storage medium Active CN109614613B (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
CN201811459428.7A CN109614613B (en) 2018-11-30 2018-11-30 Image description statement positioning method and device, electronic equipment and storage medium
KR1020207008623A KR102454930B1 (en) 2018-11-30 2019-05-09 Image description statement positioning method and apparatus, electronic device and storage medium
PCT/CN2019/086274 WO2020107813A1 (en) 2018-11-30 2019-05-09 Method and apparatus for positioning descriptive statement in image, electronic device and storage medium
JP2020517564A JP6968270B2 (en) 2018-11-30 2019-05-09 Image description Position determination method and equipment, electronic devices and storage media
SG11202003836YA SG11202003836YA (en) 2018-11-30 2019-05-09 Method and apparatus for positioning description statement in image, electronic device, and storage medium
TW108142397A TWI728564B (en) 2018-11-30 2019-11-21 Method, device and electronic equipment for image description statement positioning and storage medium thereof
US16/828,226 US11455788B2 (en) 2018-11-30 2020-03-24 Method and apparatus for positioning description statement in image, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811459428.7A CN109614613B (en) 2018-11-30 2018-11-30 Image description statement positioning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109614613A CN109614613A (en) 2019-04-12
CN109614613B true CN109614613B (en) 2020-07-31

Family

ID=66006570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811459428.7A Active CN109614613B (en) 2018-11-30 2018-11-30 Image description statement positioning method and device, electronic equipment and storage medium

Country Status (7)

Country Link
US (1) US11455788B2 (en)
JP (1) JP6968270B2 (en)
KR (1) KR102454930B1 (en)
CN (1) CN109614613B (en)
SG (1) SG11202003836YA (en)
TW (1) TWI728564B (en)
WO (1) WO2020107813A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614613B (en) * 2018-11-30 2020-07-31 北京市商汤科技开发有限公司 Image description statement positioning method and device, electronic equipment and storage medium
CN110096707B (en) * 2019-04-29 2020-09-29 北京三快在线科技有限公司 Method, device and equipment for generating natural language and readable storage medium
CN110263755B (en) 2019-06-28 2021-04-27 上海鹰瞳医疗科技有限公司 Eye ground image recognition model training method, eye ground image recognition method and eye ground image recognition device
US11113689B2 (en) * 2019-07-03 2021-09-07 Sap Se Transaction policy audit
CN110413819B (en) * 2019-07-12 2022-03-29 深兰科技(上海)有限公司 Method and device for acquiring picture description information
CN110516677A (en) * 2019-08-23 2019-11-29 上海云绅智能科技有限公司 A kind of neural network recognization model, target identification method and system
US11461613B2 (en) * 2019-12-05 2022-10-04 Naver Corporation Method and apparatus for multi-document question answering
CN111277759B (en) * 2020-02-27 2021-08-31 Oppo广东移动通信有限公司 Composition prompting method and device, storage medium and electronic equipment
CN111738186B (en) * 2020-06-28 2024-02-02 香港中文大学(深圳) Target positioning method, target positioning device, electronic equipment and readable storage medium
CN111859005B (en) * 2020-07-01 2022-03-29 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
KR102451299B1 (en) * 2020-09-03 2022-10-06 고려대학교 세종산학협력단 Caption Generation System through Animal Context-Awareness
CN112084319B (en) * 2020-09-29 2021-03-16 四川省人工智能研究院(宜宾) Relational network video question-answering system and method based on actions
WO2022130509A1 (en) * 2020-12-15 2022-06-23 日本電信電話株式会社 Object detection device, object detection method, and object detection program
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
WO2024105752A1 (en) * 2022-11-14 2024-05-23 日本電信電話株式会社 Action-recognition learning device, action-recognition estimation device, action-recognition learning method, and action-recognition learning program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477908B2 (en) * 2014-04-10 2016-10-25 Disney Enterprises, Inc. Multi-level framework for object detection
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108229272A (en) * 2017-02-23 2018-06-29 北京市商汤科技开发有限公司 Vision relationship detection method and device and vision relationship detection training method and device
CN108694398A (en) * 2017-04-06 2018-10-23 杭州海康威视数字技术股份有限公司 A kind of image analysis method and device
CN108764083A (en) * 2018-05-17 2018-11-06 淘然视界(杭州)科技有限公司 Object detection method, electronic equipment, storage medium based on natural language expressing

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10007897C1 (en) 2000-02-21 2001-06-28 Siemens Ag Procedure to distribute re-directed postal items
US7181054B2 (en) * 2001-08-31 2007-02-20 Siemens Medical Solutions Health Services Corporation System for processing image representative data
DE602006021408D1 (en) 2005-04-27 2011-06-01 Univ Leiden Medical Ct TREATMENT OF HPV-INDUCED INTRAITITHELIAL ANOGENITAL NEOPLASIA
US7835820B2 (en) * 2005-10-11 2010-11-16 Vanderbilt University System and method for image mapping and visual attention
WO2008017430A1 (en) * 2006-08-07 2008-02-14 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Method for producing scaleable image matrices
TWI464604B (en) * 2010-11-29 2014-12-11 Ind Tech Res Inst Data clustering method and device, data processing apparatus and image processing apparatus
US8428363B2 (en) * 2011-04-29 2013-04-23 Mitsubishi Electric Research Laboratories, Inc. Method for segmenting images using superpixels and entropy rate clustering
CN103106239A (en) * 2012-12-10 2013-05-15 江苏乐买到网络科技有限公司 Identification method and identification device of target in image
TWI528197B (en) * 2013-09-26 2016-04-01 財團法人資訊工業策進會 Photo grouping system, photo grouping method, and computer-readable storage medium
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
GB2545661A (en) * 2015-12-21 2017-06-28 Nokia Technologies Oy A method for analysing media content
CN106777999A (en) * 2016-12-26 2017-05-31 上海联影医疗科技有限公司 Image processing method, system and device
CN108229518B (en) * 2017-02-15 2020-07-10 北京市商汤科技开发有限公司 Statement-based image detection method, device and system
CN109658455B (en) * 2017-10-11 2023-04-18 阿里巴巴集团控股有限公司 Image processing method and processing apparatus
CN108171254A (en) * 2017-11-22 2018-06-15 北京达佳互联信息技术有限公司 Image tag determines method, apparatus and terminal
CN108549850B (en) * 2018-03-27 2021-07-16 联想(北京)有限公司 Image identification method and electronic equipment
US10643112B1 (en) * 2018-03-27 2020-05-05 Facebook, Inc. Detecting content items violating policies of an online system using machine learning based model
CN108874360B (en) * 2018-06-27 2023-04-07 百度在线网络技术(北京)有限公司 Panoramic content positioning method and device
CN109614613B (en) * 2018-11-30 2020-07-31 北京市商汤科技开发有限公司 Image description statement positioning method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477908B2 (en) * 2014-04-10 2016-10-25 Disney Enterprises, Inc. Multi-level framework for object detection
CN108229272A (en) * 2017-02-23 2018-06-29 北京市商汤科技开发有限公司 Vision relationship detection method and device and vision relationship detection training method and device
CN108694398A (en) * 2017-04-06 2018-10-23 杭州海康威视数字技术股份有限公司 A kind of image analysis method and device
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108764083A (en) * 2018-05-17 2018-11-06 淘然视界(杭州)科技有限公司 Object detection method, electronic equipment, storage medium based on natural language expressing

Also Published As

Publication number Publication date
TW202022561A (en) 2020-06-16
WO2020107813A1 (en) 2020-06-04
SG11202003836YA (en) 2020-07-29
TWI728564B (en) 2021-05-21
KR20200066617A (en) 2020-06-10
KR102454930B1 (en) 2022-10-14
CN109614613A (en) 2019-04-12
US20200226410A1 (en) 2020-07-16
JP2021509979A (en) 2021-04-08
JP6968270B2 (en) 2021-11-17
US11455788B2 (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN109614613B (en) Image description statement positioning method and device, electronic equipment and storage medium
CN109614876B (en) Key point detection method and device, electronic equipment and storage medium
CN110210535B (en) Neural network training method and device and image processing method and device
CN111310616B (en) Image processing method and device, electronic equipment and storage medium
CN107491541B (en) Text classification method and device
CN109089133B (en) Video processing method and device, electronic equipment and storage medium
CN110837761B (en) Multi-model knowledge distillation method and device, electronic equipment and storage medium
CN110009090B (en) Neural network training and image processing method and device
CN110598504B (en) Image recognition method and device, electronic equipment and storage medium
CN109615006B (en) Character recognition method and device, electronic equipment and storage medium
CN110781813B (en) Image recognition method and device, electronic equipment and storage medium
CN113326768B (en) Training method, image feature extraction method, image recognition method and device
CN110781323A (en) Method and device for determining label of multimedia resource, electronic equipment and storage medium
CN113792207A (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN111931844A (en) Image processing method and device, electronic equipment and storage medium
CN111435432A (en) Network optimization method and device, image processing method and device, and storage medium
CN109685041B (en) Image analysis method and device, electronic equipment and storage medium
CN112559673A (en) Language processing model training method and device, electronic equipment and storage medium
CN114880480A (en) Question-answering method and device based on knowledge graph
CN113065361B (en) Method and device for determining user intimacy, electronic equipment and storage medium
CN113553946A (en) Information prompting method and device, electronic equipment and storage medium
CN113052874A (en) Target tracking method and device, electronic equipment and storage medium
CN111178115B (en) Training method and system for object recognition network
CN111984765A (en) Knowledge base question-answering process relation detection method and device
CN111488964A (en) Image processing method and device and neural network training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40003709

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: Room 1101-1117, floor 11, No. 58, Beisihuan West Road, Haidian District, Beijing 100080

Patentee after: BEIJING SENSETIME TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 100084, room 7, floor 3, building 1, No. 710-712, Zhongguancun East Road, Beijing, Haidian District

Patentee before: BEIJING SENSETIME TECHNOLOGY DEVELOPMENT Co.,Ltd.

CP02 Change in the address of a patent holder