WO2020107813A1 - 图像的描述语句定位方法及装置、电子设备和存储介质 - Google Patents
图像的描述语句定位方法及装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2020107813A1 WO2020107813A1 PCT/CN2019/086274 CN2019086274W WO2020107813A1 WO 2020107813 A1 WO2020107813 A1 WO 2020107813A1 CN 2019086274 W CN2019086274 W CN 2019086274W WO 2020107813 A1 WO2020107813 A1 WO 2020107813A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- sample
- analyzed
- sentence
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5854—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/30—Scenes; Scene-specific elements in albums, collections or shared content, e.g. social network photos or video
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1916—Validation; Performance evaluation
Definitions
- the present application relates to the field of computer vision technology but is not limited to the field of vision technology, and in particular, to a method and device for positioning descriptive sentences of an image, electronic equipment, and storage media.
- Referential phrase localization is an important issue in the intersection of computer vision and natural language processing.
- the machine may be required to locate the object (person or object, etc.) described in the sentence according to a given sentence (sentence) in the image.
- a combined modular network composed of positioning modules and relationship modules is proposed to identify objects and their relationships.
- these models may be overly dependent on specific words or visual concepts and tend to be frequently observed evidence, The corresponding effect of the sentence and the image is poor.
- This application proposes a technical solution for positioning description sentences of images.
- an image description sentence positioning method which includes: analyzing a description sentence to be analyzed and an image to be analyzed, obtaining a plurality of sentence attention weights of the description sentence to be analyzed and the waiting Analyzing multiple image attention weights of the image; according to the multiple sentence attention weights and subject characteristics, position characteristics and relationship characteristics of the image to be analyzed, multiple first matching scores are obtained, wherein the image to be analyzed includes Multiple objects, the subject object is the object with the highest attention weight among the multiple objects, the subject feature is the feature of the subject object, the position feature is the position feature of the multiple objects, the relationship feature Relational characteristics between the multiple objects; based on the multiple first matching scores and the multiple image attention weights, obtaining a second match between the description sentence to be analyzed and the image to be analyzed Score; according to the second matching score, determine the positioning result of the description sentence to be analyzed in the image to be analyzed.
- an image description sentence positioning apparatus including: a first weight obtaining module configured to perform analysis processing on a description sentence to be analyzed and an image to be analyzed to obtain a plurality of description sentence to be analyzed Sentence attention weight and a plurality of image attention weights of the image to be analyzed; the first score obtaining module is configured to obtain, based on the plurality of sentence attention weights and subject characteristics, position characteristics and relationship characteristics of the image to be analyzed Multiple first matching scores, wherein the image to be analyzed includes multiple objects, the subject object is the object with the highest attention weight among the multiple objects, and the subject feature is the feature of the subject object, the The position feature is a position feature of the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects; a second score obtaining module is configured based on the plurality of first matching scores and the plurality of objects Image attention weight to obtain a second matching score between the description sentence to be analyzed and the image to be analyzed; the result determination module is
- an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.
- a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above method when executed by a processor.
- the sentence attention weight of the description sentence to be analyzed and the image attention weight of the image to be analyzed can be obtained; multiple first matches are obtained according to the sentence attention weight and the main body feature, position feature and relationship feature of the image Score; and obtain the second matching score according to the first matching score and the image attention weight; determine the positioning result according to the second matching score, so as to fully discover the correspondence between the text and visual semantics, and improve the accuracy of the description sentence in the image degree.
- FIG. 1 shows a flowchart of an image description sentence positioning method according to an embodiment of the present application.
- FIG. 2 shows a schematic diagram of a neural network according to an embodiment of the present application.
- FIG. 3 shows a schematic diagram of obtaining a second sample description sentence according to an embodiment of the present application.
- FIG. 4 shows a schematic diagram of obtaining a second sample image according to an embodiment of the present application.
- FIG. 5 shows a block diagram of an image description sentence positioning apparatus according to an embodiment of the present application.
- FIG. 6 shows a block diagram of an electronic device according to an embodiment of the present application.
- FIG. 7 shows a block diagram of an electronic device according to an embodiment of the present application.
- the image description sentence positioning method may be executed by an electronic device such as a terminal device or a server, and the terminal device may be a user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular Telephones, cordless phones, personal digital assistants (PDAs), handheld devices, computing devices, in-vehicle devices, wearable devices, etc.
- UE User Equipment
- PDA personal digital assistant
- the method can be implemented by the processor calling computer-readable instructions stored in memory .
- the method can be performed by a server.
- FIG. 1 shows a flowchart of an image description sentence positioning method according to an embodiment of the present application. The method includes:
- step S11 the description sentence to be analyzed and the image to be analyzed are analyzed to obtain multiple sentence attention weights of the description sentence to be analyzed and multiple image attention weights of the image to be analyzed.
- the image to be analyzed may include multiple objects (humans, animals, objects, etc.), for example, multiple people riding horses.
- the description sentence to be analyzed may be a description of an object in the image to be analyzed, for example, "a brown horse riding by a girl in the middle". There may or may not be correspondence between the image to be analyzed and the description sentence to be analyzed. The association between the sentence and the image may be determined according to the method of the embodiment of the present application.
- the multiple statement attention weights of the statement to be analyzed may include the subject weight of the statement, the position weight of the statement, and the relationship weight of the statement, which are respectively used to represent the attention corresponding to different types of participles of the statement to be analyzed Force weight.
- multiple image attention weights of the image to be analyzed may include subject object weights, object position weights, and object relationship weights, which are respectively used to represent the attention corresponding to different types of image regions of the image to be analyzed Weights.
- step S12 multiple first matching scores are obtained based on the multiple sentence attention weights and the subject characteristics, position characteristics, and relationship characteristics of the image to be analyzed, wherein the image to be analyzed includes multiple objects, subjects
- the object is the object with the highest attention weight among the plurality of objects
- the subject feature is the feature of the subject object
- the position feature is the position feature of the plurality of objects
- the relationship feature is the multiple Characteristics of the relationship between objects.
- the image to be analyzed includes multiple objects (people, animals, objects, etc.), and the subject object is the object with the highest attention weight among the multiple objects.
- the subject feature is an image feature of the subject object itself
- the position feature is a position feature that reflects a relative position between the multiple objects
- the relationship feature is a relative feature that reflects the multiple objects Characteristics of the relationship.
- the plurality of first matching scores may include subject matching scores, position matching scores, and relationship matching scores.
- the subject matching score is used to evaluate the degree of matching between the subject object in the image to be analyzed and the object description of the description sentence to be analyzed;
- the position matching score evaluates the relative position of multiple objects in the image to be analyzed and the position description of the description sentence to be analyzed.
- the relationship matching score is used to evaluate the degree of matching between the association of multiple objects in the image to be analyzed and the associated description of the description sentence to be analyzed.
- step S13 according to the plurality of first matching scores and the plurality of image attention weights, a second matching score between the description sentence to be analyzed and the image to be analyzed is obtained.
- the description sentence to be analyzed and the image to be analyzed can be obtained 'S second match score.
- the second matching score is used to evaluate the overall matching degree between the image to be analyzed and the description sentence to be analyzed.
- step S14 the positioning result of the description sentence to be analyzed in the image to be analyzed is determined according to the second matching score.
- the positioning position of the description sentence to be analyzed in the image to be analyzed may be further determined, so as to realize the positioning of the description sentence in the image.
- the sentence attention weight of the description sentence to be analyzed and the image attention weight of the image to be analyzed can be obtained; multiple first matches can be obtained according to the sentence attention weight and the main body feature, position feature and relationship feature of the image Score; and obtain the second matching score according to the first matching score and the image attention weight; determine the positioning result based on the second matching score, so as to fully discover the correspondence between the text and visual semantics, and improve the accuracy of the positioning of the description sentence in the image degree.
- step S11 the description sentence to be analyzed and the image to be analyzed may be analyzed and processed to obtain multiple sentence attention weights of the description sentence to be analyzed and multiple images of the image to be analyzed Attention weight.
- step S11 may include:
- feature extraction can be performed on the image to be analyzed and the description sentence to be analyzed separately.
- feature extraction can be performed on all pixels of the image to be analyzed to obtain an image feature vector e 0 of the image to be analyzed. This application does not limit the feature extraction method of the analyzed image.
- word segmentation processing can be performed to determine multiple word breakers of the description sentence to be analyzed, and feature extraction is performed for each word breaker to obtain a word fragment embedding vector (word) for multiple word breakers embedding)
- word represents the number of word segmentation (T is an integer greater than 1)
- e t represents the t-th word segmentation embedding vector, 1 ⁇ t ⁇ T.
- the present application does not limit the specific word segmentation method for analyzing description sentences and the specific method for feature extraction of each word segmentation.
- multiple sentence attention weights of the description sentence to be analyzed and the image to be analyzed may be determined Multiple image attention weights.
- the method may further include: acquiring a plurality of sentence attention weights of the description sentence to be analyzed and a plurality of image attention weights of the image to be analyzed through a neural network.
- the neural network may include a language attention network, and the language attention network may be realized through a network such as a recurrent neural network RNN and a long-short memory network LSTM.
- the image to be analyzed, the description sentence to be analyzed and the input language attention network may be processed to obtain the plurality of sentence attention weights and the plurality of image attention weights.
- the image to be analyzed, the description sentence to be analyzed and the input language attention network may be processed to obtain the multiple sentence attention weights and the multiple image attention weights .
- feature extraction can be performed through the feature extraction sub-network of the language attention network to obtain the image feature vector e 0 and the word segmentation embedding vector, respectively
- the feature extraction sub-network may be a convolutional neural network CNN (for example, Faster CNN).
- the language attention network may have an LSTM network based on the attention mechanism.
- the image feature vector e 0 can be used as the first level input of the LSTM network, and the word segmentation can be embedded in the vector As the input of all levels of the LSTM network, to obtain the output state h t of the multiple hidden layers of the LSTM network.
- the image attention weight and the attention weight of each participle can be calculated; according to the attention weight of multiple participles, embedding vectors for multiple participles By performing weighted summation, the sentence attention weight can be obtained.
- the multiple sentence attention weights of the sentence to be analyzed and described are word-level attention weights, which may include sentence subject weight q subj , sentence position weight q loc and sentence relationship
- the weights q rel are respectively used to represent the attention weights corresponding to the different types of participles of the description sentence to be analyzed.
- the sentence subject weight is used to indicate the attention weight when paying attention to the subject participle in the sentence, for example, the sentence “brown horse” or “horse” of the subject in the sentence "brown horse rided by a girl in the middle”
- the attention weight of the sentence; the sentence position weight is used to indicate the attention weight when paying attention to the word segmentation indicating the position in the sentence, for example, the attention weight of the position segmentation “in the middle” in the above sentence;
- the sentence relationship weight is used to indicate the attention
- the attention weight in the sentence segmentation that represents the relationship between objects in the sentence for example, the attention weight in the sentence above that represents the segmentation "ridden by a girl” that represents the relationship between objects.
- multiple image attention weights of the image to be analyzed are module-level attention weights, which may include subject object weight ⁇ subj , object position weight ⁇ loc, and object relationship weight ⁇ rel is used to represent the attention weights corresponding to different types of image regions of the image to be analyzed.
- the subject object weight can represent the attention weight when paying attention to the most important object (subject object) among multiple objects (people, animals, objects, etc.) in the image, such as the person in the middle of the image;
- the object position weight can represent attention to the image Attention weight for the relative position of multiple objects, such as the middle, left, and right positions of the image;
- object relationship weight can represent the attention weight when paying attention to the correlation between multiple objects in the image, such as the middle of the image , On the left and right, there are people riding horses.
- the image attention weight may be determined according to various image parameters of the object in the image, and the image parameters include but are not limited to: the distribution position of the object in the image, the area occupied by the object in the image 3.
- the main color of the object in the image For example, according to the distribution position, objects in the middle of the image can obtain a higher image attention weight than objects at the edge of the image. As another example, an object that occupies a larger area in the image has a higher object than an object that occupies a smaller area.
- the subject's main color is the color of the tracked target, which may have a higher image attention weight than other colors.
- the image attention object is determined according to the object’s various presentation states; for example, based on the image frame analysis of the road surveillance video, if the object being tracked is a vehicle, the object with violations has a higher High image attention, for example, the vehicle contained in the image has behavioral characteristics of pressure realization, the object can be configured with higher image attention weight.
- image attention weights are only an example of image attention weights, and specific image attention weights can be configured according to image processing requirements; it is not limited to the above examples.
- the method further includes: inputting the image to be analyzed into a feature extraction network for processing to obtain subject characteristics, position characteristics, and relationship characteristics of the image to be analyzed.
- the feature extraction network may be one or more preset convolutional neural networks CNN (for example, Faster R-CNN), which are used to extract the subject feature, position feature and relationship feature of the image to be analyzed. All pixels of the image to be analyzed can be input into the feature extraction network, and the feature map before the ROI pooling can be used as the overall image feature of the image to be analyzed.
- CNN convolutional neural networks
- multiple objects in the image to be analyzed can be identified, and the object with the highest attention weight in multiple regions is extracted as the subject object, and the feature map of the region of the subject object is determined It is the main feature. For example, a 7 ⁇ 7 feature map is extracted as the subject feature.
- the position can be obtained according to the relative position offset and relative area between the image areas where multiple objects in the image to be analyzed are located, and the position and relative area of the object itself feature.
- the context objects can be determined Relationship characteristics.
- multiple first matching scores may be obtained according to the multiple sentence attention weights and the subject characteristics, position characteristics, and relationship characteristics of the image to be analyzed.
- the neural network may include an image attention network, and the image attention network includes a subject network, a location network, and a relationship network.
- the main body network, the location network and the relational network can be convolutional neural networks CNN constructed in advance respectively.
- the subject network is used to evaluate the degree of matching between the most important object (subject object) in the multiple objects (person, animal, object, etc.) in the image to be analyzed and the object description of the description sentence to be analyzed;
- the location network is used to evaluate The degree of matching between the relative position of multiple objects in the image to be analyzed and the position description of the description sentence to be analyzed;
- the relationship network is used to evaluate the relationship between the association of multiple objects in the image to be analyzed and the associated description of the description sentence to be analyzed Matching degree.
- the attention weights of the multiple sentences and the subject features, position features, and relationship features of the image to be analyzed can be input and processed in the subject network, position network, and relation network, respectively, to evaluate the image and The degree of matching of various aspects of the statement.
- the subject object is the object with the highest attention weight among the multiple objects of the image to be analyzed
- the subject feature is the feature of the subject object
- the position feature is the position feature of the multiple objects
- the relationship feature is Describe the relationship between multiple objects.
- the multiple first matching scores obtained in step S12 may include subject matching scores, position matching scores, and relationship matching scores.
- step S12 may include: inputting sentence subject weights and subject characteristics into the subject network for processing to obtain subject matching scores; entering sentence position weights and position features into the location network for processing to obtain Position matching scores; input sentence relationship weights and relationship characteristics into the relationship network for processing to obtain relationship matching scores.
- the sentence subject weight and subject characteristics are input into the subject network, and the matching degree between the subject of the sentence to be analyzed and the subject object of the image to be analyzed can be analyzed to obtain the subject matching score; the sentence position weight and position In the feature input location network, the degree of matching between the position word segmentation of the description sentence to be analyzed and the relative positions of the multiple objects of the image to be analyzed can be analyzed to obtain a position matching score; the sentence relationship weights and relationship features can be entered into the relationship network. The degree of matching between the relationship word segmentation of the description sentence to be analyzed and the relevance of multiple objects of the image to be analyzed is analyzed to obtain a relationship matching score.
- sentence attention weights (sentence subject weight q subj , sentence position weight q loc and sentence relationship weight q rel ), as well as multiple object features (subject feature, position feature, relationship feature), can be entered separately Main body network, location network and relational network.
- the first Second match score according to the plurality of first matching scores and the plurality of image attention weights, the first Second match score. That is, according to subject matching score, position matching score and relationship matching score, and subject object weight ⁇ subj , object position weight ⁇ loc and object relationship weight ⁇ rel , the relationship between the description sentence to be analyzed and the image to be analyzed is obtained Second match score.
- step S13 may include:
- the subject matching score, the position matching score, and the relationship matching score are weighted and averaged to determine the second matching score.
- the subject matching score, position matching score, and relationship matching score can be respectively determined according to the subject object weight ⁇ subj , object position weight ⁇ loc, and object relationship weight ⁇ rel .
- the matching scores are weighted, and the weighted scores are summed and then averaged.
- the average value may be determined as the first matching score between the description sentence to be analyzed and the image to be analyzed.
- step S14 the positioning result of the description sentence to be analyzed in the image to be analyzed may be determined according to the second matching score. That is, after the second matching score is obtained, the positioning result of the description sentence to be analyzed in the image to be analyzed can be further determined.
- step S14 may include:
- the image area of the subject object is determined as the positioning position of the description sentence to be analyzed.
- the threshold of the matching score (for example, a preset threshold of 70 points) may be set in advance, and if the second matching score is greater than or equal to the preset threshold, the description sentence to be analyzed may be considered to be the subject object in the image to be analyzed Description, the image area where the subject object is located can be determined as the location of the description sentence to be analyzed. Conversely, if the second matching score is less than the preset threshold, the description sentence to be analyzed may not be the description of the subject object in the image to be analyzed, and the positioning result may be determined to be unable to correspond. It should be understood that a person skilled in the art may set the preset threshold according to actual conditions, and the specific value of the preset threshold is not limited in this application.
- multiple subject objects can be set in the image to be analyzed, and subject features of each subject object can be input into the image attention network for processing to determine the second matching score of each subject object And can determine the highest score among multiple second match scores.
- the description sentence to be analyzed may be regarded as a description of the subject object corresponding to the highest score, and the image area where the subject object is located may be determined as the positioning position of the description sentence to be analyzed.
- FIG. 2 shows a schematic diagram of a neural network according to an embodiment of the present application.
- the neural network may include a language attention network 21 and an image attention network.
- the image attention network includes a subject network 22, a location network 23 and a relationship network 24.
- the description sentence to be analyzed “brown horse riding by a girl in the middle” 201 and the image to be analyzed 202 are input into the language attention network 21 for processing, and three image attention weights (subject object Weight ⁇ subj , object position weight ⁇ loc and object relationship weight ⁇ rel ), and simultaneously output three sentence attention weights (sentence subject weight q subj , sentence position weight q loc and sentence relationship weight q rel ).
- the subject feature 203, position feature 204, and relationship feature 205 of the image to be analyzed can be obtained through a feature extraction network (not shown).
- the position match score and the relationship between the matching score is weighted score summed and then the weighted average To obtain a second matching score 206, and then determine the positioning result of the description sentence to be analyzed in the image to be analyzed according to the second matching score 206, thereby completing the entire implementation process of steps S11-S14.
- the method before step S11, further includes: training the neural network with a sample set, the sample set includes a plurality of positive sample pairs and a plurality of negative sample pairs.
- Each positive sample pair includes the first sample image and its first sample description sentence
- Each negative sample pair includes a first sample image and a second sample description sentence after word segmentation is removed from the first sample description sentence, or a first sample description sentence and removed from the first sample image The second sample image after the area.
- the attention-based cross-modal removal method can be used to remove visual or text information with high attention weight to obtain the removed training samples (second sample description sentence and second sample Image) to improve training accuracy.
- a sample set including multiple training samples may be preset to train the neural network.
- the sample set includes a plurality of positive sample pairs, and each positive sample pair includes a first sample image O and its first sample description sentence Q.
- the sentence describing the object in the first sample image may be used as the first sample description sentence in the same positive sample pair.
- the sample set may further include a plurality of negative sample pairs, each negative sample pair includes a first sample image and a second sample description sentence after removing the word segmentation from the first sample description sentence, or the first sample description sentence and the The second sample image after the region is removed from the first sample image.
- the method may further include:
- spatial attention can be guided through the language attention network to remove the most important text information to obtain more difficult text training samples, thereby avoiding the neural network from over-reliance on specific text information (word segmentation), Improve the accuracy of the trained neural network
- FIG. 3 shows a schematic diagram of obtaining a second sample description sentence according to an embodiment of the present application.
- the first sample description sentence of the positive sample pair such as "a brown horse riding by a girl in the middle”
- the first sample image such as including multiple riding (Picture of the person wearing the horse)
- Enter the language attention network to get the attention weight of the multiple words in the first sample description sentence.
- the word segmentation with the highest attention weight for example, "middle" can be determined.
- an unknown identifier can be used to replace the participle "middle” to obtain a second sample description sentence Q* (in the "unknown" girl ride Brown horse), so that the first sample image and the second sample description sentence can be used as a negative sample pair.
- the method may further include:
- the image attention network can be used to identify and remove the most important visual information to obtain more difficult image training samples, thereby avoiding the neural network from over-reliance on specific visual information and improving the training of the neural network. Precision.
- FIG. 4 shows a schematic diagram of obtaining a second sample image according to an embodiment of the present application.
- the first sample image of the positive sample pair for example, including pictures of a plurality of people riding horses
- the first sample description sentence for example, “in the middle is a girl riding "Brown horse”” input image attention network processing.
- the main network of the image attention network may also be used, or the location network or the relationship network may also be used, which is not limited in this application.
- inputting the first sample image and the first sample description sentence into the main body network can obtain the attention weight of each region of the first sample image.
- the target area with the highest attention weight for example, the image area where the girl in the middle is located
- a second sample image O* (as shown in FIG. 4) can be obtained, so that the second sample image and the first sample description sentence can be used as a negative sample pair .
- the step of training the neural network using the sample set may include: determining the overall loss of the neural network according to the first loss and the second loss of the neural network.
- the network loss of the positive sample pair (the first sample image and its first sample description sentence) can be obtained as the first loss.
- the step of training the neural network by using a sample set may further include: training the neural network according to the overall loss.
- the above neural network can be trained according to the overall network loss L. So as to determine the neural network after training. This application does not limit the specific training methods of neural networks.
- the method before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the first loss.
- the step of obtaining the first loss includes:
- the network loss of the positive sample pair (the first sample image and its first sample description sentence) can be obtained.
- the first sample image O i and the first sample description sentence Q i of the same positive sample pair (O i , Q i ) can be input into the neural network shown in FIG. 2 for processing To get the first training score s(O i , Q i ).
- i is the sample number, 1 ⁇ i ⁇ N, and N is the number of positive sample pairs in the sample set.
- the first sample image of different positive sample pairs and the first sample description sentence (O i , Q j ) that does not correspond to it it can be input to the nerve shown in FIG. 2 Process in the network to obtain the second training score s(O i , Q j ).
- j is the sample number, 1 ⁇ j ⁇ N, and j is not equal to i.
- inputting the first sample image and the first sample description sentence (O j ,Q i ) of different positive sample pairs into the neural network can obtain another second training score s(O j ,Q i ).
- processing the positive sample pairs (the first sample image and the first sample description sentence) in the training set separately can obtain multiple first training scores and multiple second training scores, Then the first loss L rank of the original sample can be obtained:
- the operator [x] + can represent the maximum value between x and 0, that is, the value of x is taken when x is greater than 0, and 0 is taken when x is less than or equal to 0; m can be a constant, The distance used to represent network loss. It should be understood that those skilled in the art can set the value of m (for example, 0.1) according to the actual situation, and the specific value of m is not limited in this application.
- the method before determining the overall loss of the neural network according to the first loss and the second loss of the neural network, the method further includes: obtaining the second loss;
- the step of obtaining the second loss includes:
- the network loss of the removed negative samples (the second sample image and the second sample description sentence) can be obtained.
- the second sample image can be And the first sample description sentence Q i is input into the neural network shown in FIG. 2 and processed to obtain the third training score
- i is the sample number, 1 ⁇ i ⁇ N, and N is the number of sample pairs in the sample set.
- the first sample image and the corresponding second sample description sentence are input into the neural network to obtain the fifth training score Pair different negative samples
- the first sample image and the second sample description sentence are input to the neural network, and the sixth training score can be obtained
- multiple positive sample pairs (the first sample image and the first sample description sentence) and the removed negative sample pairs in the training set are processed separately to obtain multiple third training Score, multiple fourth training scores, multiple fifth training scores and multiple sixth training scores, and then the second loss L erase of the removed sample can be obtained:
- the operator [x] + can represent the maximum value between x and 0, that is, the value of x is taken when x is greater than 0, and 0 is taken when x is less than or equal to 0; m can be a constant, The distance used to represent network loss. It should be understood that those skilled in the art can set the value of m (for example, 0.1) according to the actual situation, and the specific value of m is not limited in this application.
- the overall loss of the neural network may be determined according to the first loss and the second loss, and then the neural network may be trained according to the overall loss.
- the step of determining the overall loss of the neural network according to the first loss and the second loss of the neural network may include: weighting and superimposing the first loss and the second loss to obtain the neural network The overall loss.
- the overall network loss L of the neural network can be calculated by the following formula:
- ⁇ and ⁇ represent the weights of the first loss and the second loss, respectively. It should be understood that those skilled in the art may set the values of ⁇ and ⁇ according to the actual situation, and the specific values of ⁇ and ⁇ are not limited in this application.
- the above neural network can be trained according to the overall network loss L.
- a reverse gradient method may be used to adjust the network parameter value of the neural network; and the overall network loss L may be obtained again.
- the trained neural network can be determined. This application does not limit the specific training methods of neural networks.
- the most important visual or text information with high attention weight is eliminated by cross-mode erasure to generate difficult training samples, thereby driving the neural network model to find the most important Supplementary evidence in addition to the evidence.
- the erasure image of the original query sentence is used, or the erasure query sentence of the original image is used to form a more difficult training sample, so that the neural network model can better use the training data to learn potential text-picture Correspondence, and does not increase the complexity of reasoning.
- the present application can be applied to a terminal such as a robot or a mobile phone, and the position of a person in an image is located according to human guidance (text or voice), thereby achieving accurate correspondence between text and image.
- FIG. 5 shows a block diagram of an image description sentence positioning apparatus according to an embodiment of the present application.
- the image description sentence positioning apparatus includes:
- the first weight obtaining module 51 is configured to perform analysis processing on the description sentence to be analyzed and the image to be analyzed, to obtain multiple sentence attention weights of the description sentence to be analyzed and multiple image attention weights of the image to be analyzed;
- the first score obtaining module 52 is configured to obtain multiple first matching scores according to the attention weights of the multiple sentences and the subject characteristics, position characteristics and relationship characteristics of the image to be analyzed, wherein the image to be analyzed includes multiple Objects, the subject object is the object with the highest attention weight among the multiple objects, the subject feature is the feature of the subject object, the position feature is the position feature of the multiple objects, and the relationship feature is Characteristics of the relationship between the multiple objects;
- the second score obtaining module 53 is configured to obtain a second match score between the description sentence to be analyzed and the image to be analyzed according to the multiple first matching scores and the multiple image attention weights;
- the result determination module 54 is configured to determine the positioning result of the description sentence to be analyzed in the image to be analyzed according to the second matching score.
- the first weight obtaining module includes:
- An image feature extraction sub-module configured to perform feature extraction on the image to be analyzed to obtain an image feature vector of the image to be analyzed
- the word segmentation feature extraction sub-module is configured to perform feature extraction on the description sentence to be analyzed to obtain a word segmentation embedding vector of multiple word segments of the description sentence to be analyzed;
- the first weight obtaining submodule is configured to obtain a plurality of sentence attention weights of the description sentence to be analyzed and a plurality of images of the image to be analyzed according to the image feature vector and the word segmentation embedding vector of the plurality of word segmentation Attention weight.
- the device further includes: a second weight obtaining module configured to obtain a plurality of sentence attention weights of the description sentence to be analyzed and a plurality of images of the image to be analyzed through a neural network Attention weight.
- the multiple sentence attention weights include sentence subject weights, sentence position weights, and sentence relationship weights
- the neural network includes an image attention network
- the image attention network includes a subject network
- the multiple first matching scores include subject matching scores, position matching scores, and relationship matching scores
- the first score obtaining module includes:
- the first score obtaining submodule is configured to input the subject weight and subject characteristics of the sentence into the subject network for processing to obtain the subject matching score;
- a second score obtaining sub-module configured to input the sentence position weight and position characteristics into the position network for processing to obtain the position matching score
- the third score obtaining submodule is configured to input the sentence relationship weights and relationship characteristics into the relationship network for processing to obtain the relationship matching score.
- the multiple image attention weights include subject object weights, object position weights, and object relationship weights
- the second score obtaining module includes:
- a fourth score obtaining submodule configured to perform weighted averaging on the subject matching score, the position matching score, and the relationship matching score according to the subject object weight, the object position weight, and the object relationship weight; The second matching score is determined.
- the device further includes:
- the third weight obtaining module is configured to input the image to be analyzed into a feature extraction network for processing to obtain the subject feature, the position feature and the relationship feature.
- the result determination module includes:
- the position determination submodule is configured to determine the image area of the subject object as the positioning position of the description sentence to be analyzed when the second matching score is greater than or equal to a preset threshold.
- the second weight obtaining module before the second weight obtaining module, it further includes: a training module for training the neural network with a sample set, the sample set includes multiple positive sample pairs and multiple negative samples Sample pair,
- Each positive sample pair includes the first sample image and its first sample description sentence
- Each negative sample pair includes a first sample image and a second sample description sentence after word segmentation is removed from the first sample description sentence, or a first sample description sentence and removed from the first sample image The second sample image after the area.
- the neural network further includes a language attention network
- the device further includes:
- the word segmentation weight determination module is used to input the first sample description sentence and the first sample image of the positive sample pair into the language attention network to obtain the attention of multiple word segments of the first sample description sentence Weights;
- the word segmentation module is used to replace the word segmentation with the highest attention weight in the first sample description sentence with a predetermined identifier to obtain the second sample description sentence;
- the first negative sample pair determining module is configured to use the first sample image and the second sample description sentence as negative sample pairs.
- the device further includes:
- An image weight determination module configured to input the first sample description sentence of the positive sample pair and the first sample image into the image attention network to obtain the attention weight of the first sample image
- An area removal module used to remove the image area with the highest attention weight in the first sample image to obtain a second sample image
- the second negative sample pair determination module is configured to use the second sample image and the first sample description sentence as negative sample pairs.
- the training module includes:
- the overall loss determination submodule is configured to determine the overall loss of the neural network according to the first loss and the second loss of the neural network;
- the training sub-module is configured to train the neural network according to the overall loss.
- the device further includes: a first loss obtaining submodule configured to obtain the first loss before the overall loss determining submodule; the first loss obtaining submodule is configured to :
- the device further includes: a second loss obtaining submodule configured to obtain the second loss before the overall loss determining submodule; the second loss obtaining submodule configuration for:
- the overall loss determination submodule is configured as:
- Weighting and superimposing the first loss and the second loss to obtain the overall loss of the neural network Weighting and superimposing the first loss and the second loss to obtain the overall loss of the neural network.
- the functions provided by the apparatus provided in the embodiments of the present application or the modules contained therein may be used to perform the methods described in the foregoing method embodiments.
- the functions provided by the apparatus provided in the embodiments of the present application or the modules contained therein may be used to perform the methods described in the foregoing method embodiments.
- An embodiment of the present application also proposes a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- An embodiment of the present application further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.
- the electronic device may be provided as a terminal, server, or other form of device.
- FIG. 6 shows a block diagram of an electronic device 800 according to an embodiment of the present application.
- the electronic device 800 may be a terminal such as a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness device, and personal digital assistant.
- the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, and a sensor component 814 , ⁇ 816.
- the processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
- the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps in the above method.
- the processing component 802 may include one or more modules to facilitate interaction between the processing component 802 and other components.
- the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
- the memory 804 is configured to store various types of data to support operation at the electronic device 800. Examples of these data include instructions for any application or method operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, and so on.
- the memory 804 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read only memory
- EPROM erasable and removable Programmable read only memory
- PROM programmable read only memory
- ROM read only memory
- magnetic memory flash memory
- flash memory magnetic disk or optical disk.
- the power supply component 806 provides power to various components of the electronic device 800.
- the power component 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
- the multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
- the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
- the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
- the audio component 810 is configured to output and/or input audio signals.
- the audio component 810 includes a microphone (MIC).
- the microphone is configured to receive an external audio signal.
- the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
- the audio component 810 further includes a speaker for outputting audio signals.
- the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
- the peripheral interface module may be a keyboard, a click wheel, or a button. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
- the sensor component 814 includes one or more sensors for providing the electronic device 800 with status evaluation in various aspects.
- the sensor component 814 can detect the on/off state of the electronic device 800, and the relative positioning of the components, for example, the component is the display and keypad of the electronic device 800, and the sensor component 814 can also detect the electronic device 800 or the electronic device 800.
- the position of the component changes, the presence or absence of user contact with the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800.
- the sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
- the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor assembly 814 may also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
- the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
- the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
- the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
- RFID radio frequency identification
- IrDA infrared data association
- UWB ultra-wideband
- Bluetooth Bluetooth
- the electronic device 800 may be one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field Programming gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are used to implement the above method.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA field Programming gate array
- controller microcontroller, microprocessor or other electronic components are used to implement the above method.
- a non-volatile computer-readable storage medium is also provided, such as a memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to complete the above method.
- the electronic device 1900 may be provided as a server.
- the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by the memory 1932, for storing instructions executable by the processing component 1922, such as application programs.
- the application programs stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
- the processing component 1922 is configured to execute instructions to perform the above method.
- the electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to the network, and an input output (I/O) interface 1958 .
- the electronic device 1900 can operate an operating system based on the memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
- a non-volatile computer-readable storage medium is also provided, for example, a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the above method.
- This application may be a system, method, and/or computer program product.
- the computer program product may include a computer-readable storage medium, which is loaded with computer-readable program instructions for causing the processor to implement various aspects of the present application.
- the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
- the computer-readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), and erasable programmable read only memory (EPROM (Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical coding device, such as a computer on which instructions are stored
- RAM random access memory
- ROM read only memory
- EPROM erasable programmable read only memory
- SRAM static random access memory
- CD-ROM compact disk read-only memory
- DVD digital versatile disk
- memory stick floppy disk
- mechanical coding device such as a computer on which instructions are stored
- the convex structure in the hole card or the groove and any suitable combination of the above.
- the computer-readable storage medium used herein is not to be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, optical pulses through fiber optic cables), or through wires The transmitted electrical signal.
- the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device through a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- the network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device .
- the computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages Source code or object code written in any combination.
- the programming languages include object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as "C" language or similar programming languages.
- Computer readable program instructions can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remote computer, or completely on the remote computer or server carried out.
- the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to pass the Internet connection).
- electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs) or programmable logic arrays (PLA), can be personalized by utilizing the status information of computer-readable program instructions, which can be Computer-readable program instructions are executed to implement various aspects of the present application.
- These computer-readable program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, or other programmable data processing device, thereby producing a machine that causes these instructions to be executed by the processor of a computer or other programmable data processing device A device that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram is generated.
- the computer-readable program instructions may also be stored in a computer-readable storage medium. These instructions enable the computer, programmable data processing apparatus, and/or other devices to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions to implement various aspects of the functions/acts specified in one or more blocks in the flowchart and/or block diagram.
- the computer-readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operating steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing device, or other equipment implement the functions/acts specified in one or more blocks in the flowchart and/or block diagram.
- each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more Executable instructions.
- the functions marked in the blocks may also occur in an order different from that marked in the drawings. For example, two consecutive blocks can actually be executed substantially in parallel, and sometimes they can also be executed in reverse order, depending on the functions involved.
- each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented with dedicated hardware-based systems that perform specified functions or actions Or, it can be realized by a combination of dedicated hardware and computer instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Library & Information Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020517564A JP6968270B2 (ja) | 2018-11-30 | 2019-05-09 | 画像の記述文位置決定方法及び装置、電子機器並びに記憶媒体 |
| KR1020207008623A KR102454930B1 (ko) | 2018-11-30 | 2019-05-09 | 이미지의 디스크립션 스테이트먼트 포지셔닝 방법 및 장치, 전자 기기 및 저장 매체 |
| SG11202003836YA SG11202003836YA (en) | 2018-11-30 | 2019-05-09 | Method and apparatus for positioning description statement in image, electronic device, and storage medium |
| US16/828,226 US11455788B2 (en) | 2018-11-30 | 2020-03-24 | Method and apparatus for positioning description statement in image, electronic device, and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811459428.7A CN109614613B (zh) | 2018-11-30 | 2018-11-30 | 图像的描述语句定位方法及装置、电子设备和存储介质 |
| CN201811459428.7 | 2018-11-30 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/828,226 Continuation US11455788B2 (en) | 2018-11-30 | 2020-03-24 | Method and apparatus for positioning description statement in image, electronic device, and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020107813A1 true WO2020107813A1 (zh) | 2020-06-04 |
Family
ID=66006570
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/086274 Ceased WO2020107813A1 (zh) | 2018-11-30 | 2019-05-09 | 图像的描述语句定位方法及装置、电子设备和存储介质 |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US11455788B2 (https=) |
| JP (1) | JP6968270B2 (https=) |
| KR (1) | KR102454930B1 (https=) |
| CN (1) | CN109614613B (https=) |
| SG (1) | SG11202003836YA (https=) |
| TW (1) | TWI728564B (https=) |
| WO (1) | WO2020107813A1 (https=) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111738186A (zh) * | 2020-06-28 | 2020-10-02 | 香港中文大学(深圳) | 目标定位方法、装置、电子设备及可读存储介质 |
| CN116012835A (zh) * | 2023-02-20 | 2023-04-25 | 张国栋 | 一种基于文本分割的两阶段场景文本擦除方法 |
Families Citing this family (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109614613B (zh) * | 2018-11-30 | 2020-07-31 | 北京市商汤科技开发有限公司 | 图像的描述语句定位方法及装置、电子设备和存储介质 |
| CN110096707B (zh) * | 2019-04-29 | 2020-09-29 | 北京三快在线科技有限公司 | 生成自然语言的方法、装置、设备及可读存储介质 |
| CN110263755B (zh) * | 2019-06-28 | 2021-04-27 | 上海鹰瞳医疗科技有限公司 | 眼底图像识别模型训练方法、眼底图像识别方法和设备 |
| US11308492B2 (en) | 2019-07-03 | 2022-04-19 | Sap Se | Anomaly and fraud detection with fake event detection using pixel intensity testing |
| US20210004795A1 (en) * | 2019-07-03 | 2021-01-07 | Sap Se | Anomaly and fraud detection using duplicate event detector |
| US12039615B2 (en) | 2019-07-03 | 2024-07-16 | Sap Se | Anomaly and fraud detection with fake event detection using machine learning |
| CN110413819B (zh) * | 2019-07-12 | 2022-03-29 | 深兰科技(上海)有限公司 | 一种图片描述信息的获取方法及装置 |
| CN110516677A (zh) * | 2019-08-23 | 2019-11-29 | 上海云绅智能科技有限公司 | 一种神经网络识别模型、目标识别方法及系统 |
| US11461613B2 (en) * | 2019-12-05 | 2022-10-04 | Naver Corporation | Method and apparatus for multi-document question answering |
| CN111277759B (zh) * | 2020-02-27 | 2021-08-31 | Oppo广东移动通信有限公司 | 构图提示方法、装置、存储介质及电子设备 |
| CN111859005B (zh) * | 2020-07-01 | 2022-03-29 | 江西理工大学 | 一种跨层多模型特征融合与基于卷积解码的图像描述方法 |
| KR102451299B1 (ko) * | 2020-09-03 | 2022-10-06 | 고려대학교 세종산학협력단 | 동물의 상황인지를 통한 캡션 생성 시스템 |
| CN112084319B (zh) * | 2020-09-29 | 2021-03-16 | 四川省人工智能研究院(宜宾) | 一种基于动作的关系网络视频问答系统及方法 |
| WO2022130509A1 (ja) * | 2020-12-15 | 2022-06-23 | 日本電信電話株式会社 | 物体検出装置、物体検出方法、及び物体検出プログラム |
| CN113298083B (zh) * | 2021-02-25 | 2025-03-07 | 阿里巴巴集团控股有限公司 | 一种数据处理方法及装置 |
| US12147497B2 (en) * | 2021-05-19 | 2024-11-19 | Baidu Usa Llc | Systems and methods for cross-lingual cross-modal training for multimodal retrieval |
| CN113761153B (zh) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | 基于图片的问答处理方法、装置、可读介质及电子设备 |
| US12482464B2 (en) | 2021-12-07 | 2025-11-25 | Deepmind Technologies Limited | Controlling interactive agents using multi-modal inputs |
| CN117911727A (zh) * | 2022-07-08 | 2024-04-19 | 鸿海精密工业股份有限公司 | 辨识方法及其电子装置 |
| JP7835304B2 (ja) * | 2022-11-14 | 2026-03-25 | Ntt株式会社 | 行動認識学習装置、行動認識推定装置、行動認識学習方法、及び行動認識学習プログラム |
| CN118037888B (zh) * | 2024-02-01 | 2024-10-01 | 嘉达鼎新信息技术(苏州)有限公司 | 基于图像分析和语言描述的ai图像生成方法及系统 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103106239A (zh) * | 2012-12-10 | 2013-05-15 | 江苏乐买到网络科技有限公司 | 一种图像中对象的识别方法和装置 |
| US20170177972A1 (en) * | 2015-12-21 | 2017-06-22 | Nokia Technologies Oy | Method for analysing media content |
| CN108171254A (zh) * | 2017-11-22 | 2018-06-15 | 北京达佳互联信息技术有限公司 | 图像标签确定方法、装置及终端 |
| CN108549850A (zh) * | 2018-03-27 | 2018-09-18 | 联想(北京)有限公司 | 一种图像识别方法及电子设备 |
| CN108874360A (zh) * | 2018-06-27 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | 全景内容定位方法和装置 |
| CN109614613A (zh) * | 2018-11-30 | 2019-04-12 | 北京市商汤科技开发有限公司 | 图像的描述语句定位方法及装置、电子设备和存储介质 |
Family Cites Families (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE10007897C1 (de) | 2000-02-21 | 2001-06-28 | Siemens Ag | Verfahren zum Verteilen von Sendungen |
| US7181054B2 (en) * | 2001-08-31 | 2007-02-20 | Siemens Medical Solutions Health Services Corporation | System for processing image representative data |
| DE602006021408D1 (de) | 2005-04-27 | 2011-06-01 | Univ Leiden Medical Ct | Behandlung von hpv-induzierter intraepithelialer anogenitaler neoplasien |
| US7835820B2 (en) * | 2005-10-11 | 2010-11-16 | Vanderbilt University | System and method for image mapping and visual attention |
| WO2008017430A1 (de) * | 2006-08-07 | 2008-02-14 | MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. | Verfahren zur herstellung skalierbarer bildmatrizen |
| TWI464604B (zh) * | 2010-11-29 | 2014-12-11 | Ind Tech Res Inst | 資料分群方法與裝置、資料處理裝置及影像處理裝置 |
| US8428363B2 (en) * | 2011-04-29 | 2013-04-23 | Mitsubishi Electric Research Laboratories, Inc. | Method for segmenting images using superpixels and entropy rate clustering |
| TWI528197B (zh) * | 2013-09-26 | 2016-04-01 | 財團法人資訊工業策進會 | 相片分群系統及相片分群方法與電腦可讀取記錄媒體 |
| US9477908B2 (en) * | 2014-04-10 | 2016-10-25 | Disney Enterprises, Inc. | Multi-level framework for object detection |
| US9965705B2 (en) * | 2015-11-03 | 2018-05-08 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering |
| CN106777999A (zh) * | 2016-12-26 | 2017-05-31 | 上海联影医疗科技有限公司 | 图像处理方法、系统和装置 |
| CN108229518B (zh) * | 2017-02-15 | 2020-07-10 | 北京市商汤科技开发有限公司 | 基于语句的图像检测方法、装置和系统 |
| CN108229272B (zh) * | 2017-02-23 | 2020-11-27 | 北京市商汤科技开发有限公司 | 视觉关系检测方法和装置及视觉关系检测训练方法和装置 |
| CN108694398B (zh) * | 2017-04-06 | 2020-10-30 | 杭州海康威视数字技术股份有限公司 | 一种图像分析方法及装置 |
| CN108228686B (zh) * | 2017-06-15 | 2021-03-23 | 北京市商汤科技开发有限公司 | 用于实现图文匹配的方法、装置和电子设备 |
| CN109658455B (zh) * | 2017-10-11 | 2023-04-18 | 阿里巴巴集团控股有限公司 | 图像处理方法和处理设备 |
| CN108108771A (zh) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | 基于多尺度深度学习的图像问答方法 |
| US10643112B1 (en) * | 2018-03-27 | 2020-05-05 | Facebook, Inc. | Detecting content items violating policies of an online system using machine learning based model |
| CN108764083A (zh) * | 2018-05-17 | 2018-11-06 | 淘然视界(杭州)科技有限公司 | 基于自然语言表达的目标检测方法、电子设备、存储介质 |
-
2018
- 2018-11-30 CN CN201811459428.7A patent/CN109614613B/zh active Active
-
2019
- 2019-05-09 WO PCT/CN2019/086274 patent/WO2020107813A1/zh not_active Ceased
- 2019-05-09 KR KR1020207008623A patent/KR102454930B1/ko active Active
- 2019-05-09 SG SG11202003836YA patent/SG11202003836YA/en unknown
- 2019-05-09 JP JP2020517564A patent/JP6968270B2/ja active Active
- 2019-11-21 TW TW108142397A patent/TWI728564B/zh active
-
2020
- 2020-03-24 US US16/828,226 patent/US11455788B2/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103106239A (zh) * | 2012-12-10 | 2013-05-15 | 江苏乐买到网络科技有限公司 | 一种图像中对象的识别方法和装置 |
| US20170177972A1 (en) * | 2015-12-21 | 2017-06-22 | Nokia Technologies Oy | Method for analysing media content |
| CN108171254A (zh) * | 2017-11-22 | 2018-06-15 | 北京达佳互联信息技术有限公司 | 图像标签确定方法、装置及终端 |
| CN108549850A (zh) * | 2018-03-27 | 2018-09-18 | 联想(北京)有限公司 | 一种图像识别方法及电子设备 |
| CN108874360A (zh) * | 2018-06-27 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | 全景内容定位方法和装置 |
| CN109614613A (zh) * | 2018-11-30 | 2019-04-12 | 北京市商汤科技开发有限公司 | 图像的描述语句定位方法及装置、电子设备和存储介质 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111738186A (zh) * | 2020-06-28 | 2020-10-02 | 香港中文大学(深圳) | 目标定位方法、装置、电子设备及可读存储介质 |
| CN111738186B (zh) * | 2020-06-28 | 2024-02-02 | 香港中文大学(深圳) | 目标定位方法、装置、电子设备及可读存储介质 |
| CN116012835A (zh) * | 2023-02-20 | 2023-04-25 | 张国栋 | 一种基于文本分割的两阶段场景文本擦除方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| KR102454930B1 (ko) | 2022-10-14 |
| JP2021509979A (ja) | 2021-04-08 |
| TW202022561A (zh) | 2020-06-16 |
| KR20200066617A (ko) | 2020-06-10 |
| SG11202003836YA (en) | 2020-07-29 |
| CN109614613A (zh) | 2019-04-12 |
| CN109614613B (zh) | 2020-07-31 |
| TWI728564B (zh) | 2021-05-21 |
| US20200226410A1 (en) | 2020-07-16 |
| US11455788B2 (en) | 2022-09-27 |
| JP6968270B2 (ja) | 2021-11-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI728564B (zh) | 圖像的描述語句定位方法及電子設備和儲存介質 | |
| US11120078B2 (en) | Method and device for video processing, electronic device, and storage medium | |
| TWI766286B (zh) | 圖像處理方法及圖像處理裝置、電子設備和電腦可讀儲存媒介 | |
| TWI759722B (zh) | 神經網路訓練方法及裝置、圖像處理方法及裝置、電子設備和計算機可讀存儲介質 | |
| CN115100472B (zh) | 展示对象识别模型的训练方法、装置和电子设备 | |
| CN114266840B (zh) | 图像处理方法、装置、电子设备及存储介质 | |
| CN107491541B (zh) | 文本分类方法及装置 | |
| CN109145213B (zh) | 基于历史信息的查询推荐方法及装置 | |
| CN110009090B (zh) | 神经网络训练与图像处理方法及装置 | |
| CN116166843B (zh) | 基于细粒度感知的文本视频跨模态检索方法和装置 | |
| JP2022522551A (ja) | 画像処理方法及び装置、電子機器並びに記憶媒体 | |
| CN110909815A (zh) | 神经网络训练、图像处理方法、装置及电子设备 | |
| CN111523599B (zh) | 目标检测方法及装置、电子设备和存储介质 | |
| CN110781813B (zh) | 图像识别方法及装置、电子设备和存储介质 | |
| CN110659690B (zh) | 神经网络的构建方法及装置、电子设备和存储介质 | |
| EP3734472A1 (en) | Method and device for text processing | |
| CN116543211B (zh) | 图像属性编辑方法、装置、电子设备和存储介质 | |
| CN112328809A (zh) | 实体分类方法、装置及计算机可读存储介质 | |
| CN110019960A (zh) | 数据处理方法及装置、电子设备和存储介质 | |
| CN110633715B (zh) | 图像处理方法、网络训练方法及装置、和电子设备 | |
| CN111178115B (zh) | 对象识别网络的训练方法及系统 | |
| CN115146633A (zh) | 一种关键词识别方法、装置、电子设备及存储介质 | |
| CN111241844A (zh) | 一种信息推荐方法及装置 | |
| CN110781975A (zh) | 图像处理方法及装置、电子设备和存储介质 | |
| CN116484828A (zh) | 相似案情确定方法、装置、设备、介质和程序产品 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2020517564 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22/09/2021) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19890519 Country of ref document: EP Kind code of ref document: A1 |