WO2021056750A1 - 检索方法及装置、存储介质 - Google Patents

检索方法及装置、存储介质 Download PDF

Info

Publication number
WO2021056750A1
WO2021056750A1 PCT/CN2019/118196 CN2019118196W WO2021056750A1 WO 2021056750 A1 WO2021056750 A1 WO 2021056750A1 CN 2019118196 W CN2019118196 W CN 2019118196W WO 2021056750 A1 WO2021056750 A1 WO 2021056750A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
similarity
video
text
retrieval
Prior art date
Application number
PCT/CN2019/118196
Other languages
English (en)
French (fr)
Inventor
熊宇
黄青虬
郭凌峰
周航
周博磊
林达华
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to SG11202107151TA priority Critical patent/SG11202107151TA/en
Priority to KR1020217011348A priority patent/KR20210060563A/ko
Priority to JP2021521293A priority patent/JP7181999B2/ja
Publication of WO2021056750A1 publication Critical patent/WO2021056750A1/zh
Priority to US17/362,803 priority patent/US20210326383A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7343Query language or query format
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/786Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using motion, e.g. object motion or camera motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing

Definitions

  • the present disclosure relates to the field of computer vision technology, and in particular to a retrieval method, device, and storage medium.
  • the present disclosure provides a technical solution of a retrieval method.
  • a retrieval method includes: determining a first similarity between a text and at least one video, the text is used to characterize a retrieval condition; and determining the first similarity of the text A character interaction image and a second character interaction image of the at least one video; determining a second similarity between the first character interaction image and the second character interaction image; according to the first similarity and the The second degree of similarity is to determine a video that matches the retrieval condition from the at least one video.
  • the present disclosure determines the first similarity between the text and at least one video, the first character interaction diagram of the text and the second character interaction diagram of the at least one video.
  • the second degree of similarity between the two can use information such as the grammatical structure of the text itself and the event structure of the video itself to perform video retrieval, thereby improving the accuracy of retrieving videos such as movies based on text descriptions.
  • the determining the first similarity between the text and the at least one video includes: determining the paragraph feature of the text; determining the video feature of the at least one video; The paragraph feature and the video feature of the at least one video determine the first degree of similarity between the text and the at least one video.
  • the similarity between the video and the text directly matching can be obtained, which provides a reference for the subsequent determination of the video that matches the retrieval condition.
  • the paragraph features include sentence features and the number of sentences; the video features include shot features and the number of shots.
  • the text and video are quantified, and then the paragraph features of the text and the video features of the video can be analyzed. Provide analysis basis.
  • the determining the first character interaction map of the text includes: detecting the name of the person contained in the text; searching the database for the portrait of the person corresponding to the name of the person, and extracting all The image feature of the portrait is used to obtain the character node of the character; the semantic tree of the text is analyzed and determined, and the movement feature of the character is obtained based on the semantic tree to obtain the action node of the character; Corresponding character nodes are connected with action nodes; wherein the character nodes are characterized by image features of portraits; the action nodes of the characters are characterized by motion features in the semantic tree.
  • each paragraph of the text describes an event in the video.
  • the narrative structure of the video is captured by constructing the character interaction diagram of the text, so as to determine and search conditions for the follow-up.
  • the matched video provides a reference basis.
  • the method further includes: connecting the role nodes that are connected to the same action node to each other.
  • the detecting the name of the person included in the text includes: replacing the pronoun in the text with the name of the person represented by the pronoun.
  • the determining the second character interaction map of the at least one video includes: detecting a character in each shot of the at least one video; extracting the human body characteristics and movement of the character Features; attaching the human body features of the person to the role node of the person, and attaching the movement feature of the person to the action node of the person; connecting the role node and the action node corresponding to each person.
  • the present disclosure proposes a diagram-based character interaction diagram.
  • the similarity between the character interaction graph of the video and the character interaction graph of the text provides a reference for the subsequent determination of videos that match the retrieval conditions.
  • the determining the second character interaction map of the at least one video further includes: taking a group of characters that appear in a shot at the same time as the same group of characters, and placing the same group of characters in the same group of characters.
  • the character nodes of the characters are connected in pairs.
  • the determining the second character interaction map of the at least one video further includes: connecting a character in one shot with the character node of each character in the adjacent shot.
  • the determining, from the at least one video, a video that matches the retrieval condition according to the first similarity and the second similarity includes: The first similarity and the second similarity of the videos are weighted and summed to obtain the similarity value of each video; the video with the highest similarity value is determined as the video that matches the retrieval condition.
  • the retrieval method is implemented by a retrieval network, and the method further includes: determining a first similarity prediction value between the text and the video in the training sample set, the text is used to characterize the retrieval condition Determine the second similarity between the first person interaction image of the text and the second person interaction image of the video in the training sample set; according to the first similarity predicted value and the first similarity true Value to determine the loss of the first degree of similarity; determine the loss of the second degree of similarity according to the predicted value of the second degree of similarity and the true value of the second degree of similarity; determine the loss of the second degree of similarity according to the loss of the first degree of similarity and the The second similarity loss is combined with a loss function to determine a total loss value; the weight parameter of the retrieval network is adjusted according to the total loss value.
  • the retrieval network includes a first sub-network and a second sub-network; the first sub-network is used to determine the first similarity between the text and the video, and the second sub-network is used to Determining the similarity between the first character interaction image of the text and the second character interaction image of the video; the adjusting the weight parameter of the retrieval network according to the total loss value includes: based on the total loss Value adjusting the weight parameters of the first sub-network and the second sub-network.
  • determining different similarities through different sub-networks is helpful to quickly obtain the first similarity and the second similarity related to the retrieval conditions, and thus can quickly retrieve videos that are compatible with the retrieval conditions.
  • a retrieval device comprising: a first determining module configured to determine a first similarity between a text and at least one video, the text being used to characterize retrieval conditions
  • the second determining module is configured to determine the first character interaction diagram of the text and the second character interaction diagram of the at least one video; determine between the first character interaction diagram and the second character interaction diagram The second degree of similarity;
  • a processing module configured to determine a video that matches the retrieval condition from the at least one video based on the first degree of similarity and the second degree of similarity.
  • the first determining module is configured to: determine a paragraph feature of the text; determine a video feature of the at least one video; according to the paragraph feature of the text and the at least one The video feature of the video determines the first similarity between the text and the at least one video.
  • the paragraph features include sentence features and the number of sentences; the video features include shot features and the number of shots.
  • the second determining module is configured to: detect the name of the person contained in the text; search the database for the portrait of the person corresponding to the name of the person, and extract the image of the portrait Feature, obtain the character node of the character; analyze and determine the semantic tree of the text, obtain the movement feature of the character based on the semantic tree, and obtain the action node of the character; and assign the character node corresponding to each character It is connected to an action node; wherein the character node of the character is characterized by the image feature of the portrait; the action node of the character is represented by the motion feature in the semantic tree.
  • the second determining module is further configured to connect the role nodes connected to the same action node to each other.
  • the second determining module is configured to replace the pronoun in the text with the name of the person represented by the pronoun.
  • the second determining module is configured to: detect a person in each shot of the at least one video; extract the human body characteristics and movement characteristics of the person; The human body feature of is attached to the character node of the character, and the movement feature of the character is attached to the action node of the character; the character node corresponding to each character is connected to the action node.
  • the second determining module is further configured to: regard a group of characters appearing in a shot at the same time as the same group of characters, and combine the character nodes of the characters in the same group of characters as two characters. Two connected.
  • the second determining module is further configured to connect a character in one shot with the character node of each character in the adjacent shot.
  • the processing module is configured to: weighted and sum the first similarity and the second similarity of each video to obtain the similarity value of each video; The video with the highest similarity value is determined as the video that matches the retrieval condition.
  • the retrieval device is implemented through a retrieval network, and the device further includes: a training module configured to determine the first similarity prediction value between the text and the video in the training sample set, so The text is used to characterize the retrieval conditions; determine the second similarity between the first person interaction image of the text and the second person interaction image of the video in the training sample set; according to the first similarity predicted value and The first similarity true value determines the loss of the first similarity; the second similarity prediction value is determined according to the second similarity prediction value and the second similarity true value; the second similarity loss is determined according to the first similarity The loss of a similarity degree and the loss of the second similarity degree are combined with a loss function to determine a total loss value; and the weight parameter of the retrieval network is adjusted according to the total loss value.
  • a training module configured to determine the first similarity prediction value between the text and the video in the training sample set, so The text is used to characterize the retrieval conditions; determine the second similarity between the first person interaction image of the text and the
  • the retrieval network includes a first sub-network and a second sub-network; the first sub-network is used to determine the first similarity between the text and the video, and the second sub-network is used to Determine the similarity between the first character interaction image of the text and the second character interaction image of the video; the training module is configured to: adjust the first sub-network and the first sub-network based on the total loss value The weight parameter of the two subnets.
  • a retrieval device comprising: a memory, a processor, and a computer program stored in the memory and capable of being run on the processor.
  • the processor executes the program when the program is executed. The steps of the retrieval method described in the embodiment of the present disclosure.
  • a storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the retrieval method according to the embodiment of the present disclosure. step.
  • a computer program including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes the implementation of the present disclosure.
  • the technical solution provided by the present disclosure determines a first similarity between a text and at least one video, where the text is used to characterize retrieval conditions; and determines a first character interaction diagram of the text and a second character of the at least one video Interactive diagram; determine the second similarity between the first character interactive diagram and the second character interactive diagram; determine from the at least one video according to the first similarity and the second similarity The video that matches the search condition is output.
  • the present disclosure determines the first similarity between the text and at least one video, the first character interaction diagram of the text and the second character interaction diagram of the at least one video.
  • the second degree of similarity between the two can use information such as the grammatical structure of the text itself and the event structure of the video itself to perform video retrieval, thereby improving the accuracy of retrieving videos such as movies based on text descriptions.
  • Fig. 1 is a schematic diagram showing an overview framework of a retrieval method according to an exemplary embodiment
  • Fig. 2 is a schematic diagram showing the implementation process of a retrieval method according to an exemplary embodiment
  • Fig. 3 is a schematic diagram showing the composition structure of a retrieval device according to an exemplary embodiment.
  • first, second, third, etc. may be used to describe various information in the embodiments of the present disclosure, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • the words "if” and “if” as used herein can be interpreted as “when” or “when” or “in response to certainty”.
  • Fig. 1 is a schematic diagram showing an overview framework of a retrieval method according to an exemplary embodiment.
  • the framework is used for matching video and text, such as matching movie segments and plot segments.
  • the framework includes two types of modules: Event Flow Module (EFM, Event Flow Module) and Character Interaction Module (CIM, Character Interaction Module);
  • EFM Event Flow Module
  • CIM Character Interaction Module
  • the event flow module is configured to explore the event structure of the event flow, taking paragraph features and video features as input , Output the direct similarity between the video and the paragraph;
  • the character interaction module is configured to use the character interaction to construct the character interaction graph in the paragraph and the character interaction graph in the video respectively, and then measure the similarity between the two images through the graph matching algorithm.
  • the total matching score may also be a calculation result such as a weighted sum of the scores of the above two modules.
  • the embodiments of the present disclosure provide a retrieval method, which can be applied to terminal devices, servers, or other electronic devices.
  • the terminal equipment can be user equipment (UE, User Equipment), mobile equipment, cellular phones, cordless phones, personal digital assistants (PDAs, Personal Digital Assistant), handheld devices, computing devices, vehicle-mounted devices, wearable devices, and so on.
  • the processing method may be implemented by a processor invoking a computer-readable instruction stored in the memory. As shown in Figure 2, the method mainly includes:
  • Step S101 Determine a first degree of similarity between a text and at least one video, where the text is used to characterize retrieval conditions.
  • the text is a text description used to characterize the retrieval conditions.
  • the embodiment of the present disclosure does not limit the way of obtaining the text.
  • the electronic device may receive the text description input by the user in the input area, or receive the voice input by the user, and then convert the voice data into the text description.
  • the search condition includes a person's name and at least one verb that characterizes an action. For example, Jack punched himself.
  • the at least one video is located in a local or third-party video database available for retrieval.
  • the first similarity is the similarity that characterizes the direct match between the video and the text.
  • the electronic device inputs the paragraph feature of the text and the video feature of the video to the event stream module, and the event stream module outputs the similarity between the video and the text, that is, the first similarity.
  • the determining the first similarity between the text and the at least one video includes:
  • paragraph features of the text where the paragraph features include sentence features and the number of sentences;
  • the first degree of similarity between the text and the at least one video is determined.
  • determining the paragraph feature of the text includes: the first neural network can be used to process the text to obtain the paragraph feature of the text, and the paragraph feature includes the sentence feature and the number of sentences.
  • each word corresponds to a 300-dimensional vector, and the sum of the features of each word in the sentence is the feature of the sentence.
  • the number of sentences refers to the number of periods in the text. The input text is divided into sentences with periods to obtain the number of sentences.
  • determining the video characteristics of the video includes: processing the video using a second neural network, specifically, first decoding the video into a picture stream, and then obtaining the video characteristics based on the picture stream; the video characteristics include shot characteristics And the number of shots.
  • the lens feature is to obtain 3 2348-dimensional vectors through the neural network of 3 key frame pictures of the lens, and then take the average.
  • a lens refers to a continuous picture taken by the same camera in the same camera position in the video. If the picture is switched, it is another lens. The number of lenses is obtained according to the existing lens cutting algorithm.
  • the first degree of similarity is determined by analyzing the paragraph features of the text and the video features of the video, which provides a basis for the subsequent determination of videos that match the retrieval conditions; using information such as the grammatical structure of the text and the event structure of the video itself, Video retrieval can improve the accuracy of retrieving videos based on text descriptions.
  • the calculation formula of the first similarity is:
  • the constraint condition of the first similarity calculation formula includes:
  • Each shot can be assigned to 1 sentence at most;
  • the sentences assigned to the shots with the first sequence number are higher than the sentences assigned to the shots with the latter sequence number.
  • formula (3) is the optimization goal; st is the abbreviation of such that, leading to formulas (4) and (5) that express the constraints of formula (3); y i represents the i-th row vector of Y, Represents the sequence number of the first non-zero value of a Boolean vector.
  • Y is a matrix, 1 is a vector (all elements are a vector of 1), and Y1 is the product of matrix Y and vector 1.
  • the solution of the optimization problem can be obtained through the traditional dynamic programming algorithm. Specifically, through the dynamic programming algorithm related algorithm, the optimal Y can be solved, thereby obtaining Value.
  • paragraph features and video features may be weighted or proportionally calculated to obtain the first similarity.
  • Step S102 Determine the first character interaction diagram of the text and the second character interaction diagram of the at least one video.
  • the character interaction graph is a graph used to characterize the character relationship and action relationship between characters, including character nodes and action nodes.
  • one text corresponds to one first character interaction diagram
  • one video corresponds to one second character interaction diagram
  • the determining the first person interaction map of the text includes: detecting the person's name contained in the text; searching the database for the portrait of the person corresponding to the person's name, and extracting the The image feature of the portrait is used to obtain the character node of the character; the semantic tree of the text is analyzed and determined, and the movement feature of the character is obtained based on the semantic tree to obtain the action node of the character; and each character corresponds to The role node is connected to the action node.
  • the database is a library pre-stored with a large number of correspondences between names and portraits, and the portraits are portraits of people corresponding to the names.
  • Portrait data can be crawled from the Internet, for example, portrait data can be crawled from the imdb website and tmdb website.
  • the character node of the character is represented by the image feature of the portrait; the action node of the character is represented by the motion feature in the semantic tree.
  • parsing and determining the semantic tree of the text includes: parsing and determining the semantic tree of the text through a dependency syntax algorithm. For example, using a dependency syntax algorithm to divide each sentence into a word, and then according to some rules of linguistics, the word is used as a node to build a semantic tree.
  • the feature of the two nodes connected by the edge is spliced as the feature of the edge.
  • the features of the two nodes connected by the edge can be represented as two vectors, and the two vectors are spliced (for example, the dimensions are added) to obtain the feature of the edge.
  • a vector of 3 dimensions and another vector of 4 dimensions are directly spliced into a 7-dimensional vector. For example, if you splice [1,3,4] and [2,5,3,6], the result of the splicing is [1,3,4,2,5,3,6].
  • the feature of the Word2Vec word vector processed by the neural network can be used as the characterization of the action node, that is, as the movement feature of the character.
  • the pronouns in the text are replaced with the names of persons represented by the pronouns. Specifically, all names (such as "Jack") are detected by a name detection tool (such as the Stanford Name Detection Toolkit). Afterwards, the pronoun is replaced with the name of the person represented by the word through the co-referential analysis tool (for example, "he” in "Jack hits himself” is extracted as “Jack”).
  • a name detection tool such as the Stanford Name Detection Toolkit
  • a portrait of a person corresponding to the person's name is searched in a database based on the person's name, and image features of the portrait are extracted through a neural network; wherein the image features include face and body features.
  • a neural network such as nouns, pronouns, verbs, etc.
  • each node on the semantic tree is a word in the sentence ,
  • the verbs in the sentence as the movement feature of the character that is, the action node, the name corresponding to the noun or the pronoun as the character node
  • the image feature of the portrait of the character is attached to the character node; according to the semantic tree and the name , Connect the role node corresponding to each of the person's names with the action node of the person's name, and if multiple names point to the same action node, the multiple names are connected by an edge.
  • the determining the second character interaction diagram of the at least one video includes:
  • a lens refers to a continuous picture shot by the same camera in the same camera position in the video. If the picture is switched, it is another lens. The number of lenses is obtained according to the existing lens cutting algorithm.
  • the human body characteristics are the human face and body characteristics of the person, and the human body characteristics of the person in the image can be obtained by passing the image corresponding to the lens through the trained model.
  • the motion feature is the motion feature of the person in the image obtained by inputting the image corresponding to the lens into the trained model, for example, the action (such as drinking water) of the recognized person in the current image.
  • determining the second character interaction diagram of the at least one video it further includes: if a group of characters appear in a shot at the same time, connecting the character nodes of the characters in the same group of characters in pairs; The character nodes of a character in the shot and each character in the adjacent shots are connected.
  • the adjacent lens refers to the front lens and the rear lens of the current lens.
  • the feature of the two nodes connected by the edge is spliced as the feature of the edge.
  • edge feature For the determination process of the above edge feature, please refer to the method for determining the edge feature in the first character interaction graph, which will not be repeated here.
  • Step S103 Determine a second degree of similarity between the first character interaction image and the second character interaction image.
  • the second degree of similarity represents the similarity obtained by matching and calculating the two images of the first person interaction image and the second person interaction image.
  • the electronic device inputs text and video into the character interaction module, and the character interaction module constructs the first character interaction graph in the text and the second character interaction graph in the video, and then uses a graph matching algorithm to measure the relationship between the two images.
  • the similarity of output the similarity, that is, the second similarity.
  • the calculation formula of the second degree of similarity is:
  • u is a binary vector (Boolean vector)
  • u ia 1 representative of V p in node i and V Q in the a-th node can be matched
  • u ia 0 representative of V p in the i-th node and V The a-th node in q cannot be matched.
  • V p is the set of nodes
  • E p is the set of edges
  • V p is constituted by two types of nodes, Is the action node in the first character interaction diagram, Is the character node in the first character interaction diagram;
  • V q is the set of nodes
  • E q is the set of edges
  • V q is composed of two kinds of nodes, Is the action node in the second character interaction graph, Is the character node in the first character interaction diagram;
  • the similarity can be obtained through the dot product processing based on the features corresponding to the nodes or edges.
  • the constraint condition of the second similarity calculation formula includes:
  • a node can only be matched to at most one node in another set
  • the match must be a one-to-one match, that is, at most one node in another set is matched within one node.
  • Different types of nodes cannot be matched, for example, a role node cannot be matched by another set of action nodes.
  • the second degree of similarity can also be obtained through other calculation methods, for example, performing a weighted average calculation on the matched node features and action features.
  • Step S104 According to the first degree of similarity and the second degree of similarity, a video that matches the retrieval condition is determined from the at least one video.
  • the determining, from the at least one video, a video that matches the retrieval condition according to the first similarity and the second similarity includes: for each video The first similarity degree and the second similarity degree are weighted and summed to obtain the similarity value of each video; the video with the highest similarity value is determined as the video that matches the retrieval condition.
  • the weights are determined by the verification set in the database, and the weight can be adjusted on the verification set to obtain a set of optimal weights based on the feedback of the final search results, which can then be directly used on the test set or directly Actual search.
  • the video with the highest similarity value is determined as the video that matches the retrieval conditions, which can improve the accuracy of retrieving videos based on text descriptions. rate.
  • the first similarity and the second similarity can also be directly added to obtain the similarity corresponding to each video.
  • the retrieval method is implemented by a retrieval network
  • the training method of the retrieval network includes: determining a first similarity prediction value between a text and a video in a training sample set, and the text is used to characterize the retrieval condition; and determining The second degree of similarity between the first person interaction image of the text and the second person interaction image of the video in the training sample set; determined according to the first similarity predicted value and the first similarity true value
  • the loss of the first similarity; the loss of the second similarity is determined according to the predicted value of the second similarity and the true value of the second similarity; the loss of the second similarity is determined according to the loss of the first similarity and the first similarity
  • the two similarity loss is combined with the loss function to determine the total loss value; the weight parameter of the retrieval network is adjusted according to the total loss value.
  • the retrieval framework corresponding to the retrieval network has different constituent modules, and different types of neural networks can be used in each module.
  • the retrieval framework is a framework composed of an event flow module and a character relationship module.
  • the retrieval network includes a first sub-network and a second sub-network; the first sub-network is used to determine the first similarity between the text and the video, and the second sub-network is used to determine The similarity between the first character interaction image of the text and the second character interaction image of the video.
  • the text and video are input into the first sub-network, and the first sub-network outputs the first similarity prediction value between the text and the video; the text and the video are input into the second sub-network, and the second sub-network outputs the first similarity value of the text.
  • the predicted value of the similarity between the character interaction image and the second character interaction image of the video can be obtained, as well as the first character interaction image of the text
  • the first similarity loss can be obtained according to the true value of the similarity between the second person interaction image of the video and the second similarity prediction
  • the value and the true value of the second similarity are different, and the loss of the second similarity can be obtained; according to the loss of the first similarity and the loss of the second similarity, combined with the loss function to adjust the network of the first sub-network and the second self-network parameter.
  • a data set is constructed, which contains the summary of 328 movies, and the annotation associations between summary paragraphs and movie fragments.
  • the data set not only provides a high-quality detailed summary for each movie, but also associates each paragraph of the summary with movie fragments through manual annotations; here, each movie fragment can last to every minute and capture Complete event.
  • These movie fragments, together with related summary paragraphs, can allow people to analyze on a larger scope and a higher semantic level.
  • the present disclosure uses a framework including an event flow module and a character interaction module to perform matching between movie fragments and summary paragraphs. Compared with traditional feature-based matching methods, this framework can significantly improve the matching accuracy, while also revealing the importance of narrative structure and character interaction in film understanding.
  • the adjusting the weight parameter of the retrieval network according to the total loss value includes:
  • the loss function is expressed as:
  • ⁇ efm represents the model parameters embedded in the network in the event flow module
  • ⁇ cim represents the model parameters embedded in the network in the character interaction module.
  • Y is the binary matrix defined by the event flow module
  • u is the binary vector of the character interaction module
  • formula (12) expresses through the minimization function
  • Y * is the Y that maximizes the value of formula (3), which is also called the optimal solution.
  • u * is the u that maximizes the formula (7).
  • S (Q i, P j ) denotes the i th and the j i Q video paragraphs similarity P j
  • S (Q i, P i ) denotes the i th and the i i Q video paragraph P i similarity
  • S (Q j, P i ) denotes the j-th degree of similarity with the first video Q j P i of the i-th paragraph
  • [alpha] is a parameter of the loss function, represents the minimum difference similarity.
  • the detection scenes include movie fragment retrieval scenes, TV drama fragment retrieval scenes, short video retrieval scenes, and the like.
  • the retrieval method proposed in the embodiment of the present disclosure determines the first similarity between a text and at least one video, where the text is used to characterize retrieval conditions; and determines the first person interaction diagram of the text and the first person interaction diagram of the at least one video.
  • Two person interaction pictures determine the second similarity between the first person interaction picture and the second person interaction picture; according to the first similarity and the second similarity, from the at least one video Identify a video that matches the retrieval condition in the.
  • the present disclosure determines the first similarity between the text and at least one video, the first character interaction diagram of the text and the second character interaction diagram of the at least one video.
  • the second similarity between the two solves the problem that the traditional feature-based retrieval algorithm does not use the grammatical structure of the text itself and the event structure of the video itself. It adopts the method of event stream matching and the method of matching based on character interaction graphs. Video retrieval can improve the accuracy of retrieving videos based on text descriptions.
  • an embodiment of the present disclosure provides a retrieval device.
  • the device includes: a first determining module 10 configured to be a first similarity between a text and at least one video The text is used to characterize the retrieval conditions; the second determination module 20 is configured to determine the first character interaction diagram of the text and the second character interaction diagram of the at least one video; determine the first character interaction diagram The second degree of similarity between the second person interaction graph and the second person’s interaction map; the processing module 30 is configured to determine from the at least one video from the at least one video that it is similar to the second degree of similarity according to the first degree of similarity and the second degree of similarity. Search for videos that match the criteria.
  • the first determining module 10 is configured to: determine the paragraph feature of the text; determine the video feature of the at least one video; according to the paragraph feature of the text and the feature of the at least one video The video feature determines the first similarity between the text and the at least one video.
  • the paragraph features include sentence features and the number of sentences; the video features include shot features and the number of shots.
  • the second determining module 20 is configured to: detect the name of the person contained in the text; search the database for the portrait of the person corresponding to the name of the person, and extract the image characteristics of the portrait, Obtain the character node of the character; parse and determine the semantic tree of the text, obtain the movement feature of the character based on the semantic tree, and obtain the action node of the character; combine the character node and action corresponding to each character Node connection; wherein the character node of the character is characterized by the image feature of the portrait; the action node of the character is characterized by the motion feature in the semantic tree.
  • the second determining module 20 is further configured to connect the character nodes connected to the same action node to each other.
  • the second determining module 20 is configured to replace the pronoun in the text with the name of the person represented by the pronoun.
  • the second determining module 20 is configured to: detect a person in each shot of the at least one video; extract the human body characteristics and motion characteristics of the person; The feature is appended to the character node of the character, and the movement feature of the character is appended to the action node of the character; the character node corresponding to each character is connected to the action node.
  • the second determining module 20 is further configured to: regard a group of characters simultaneously appearing in a shot as the same group of characters, and connect the character nodes of the characters in the same group of characters in pairs. .
  • the second determining module 20 is further configured to connect a character in a shot with the character node of each character in the adjacent shot.
  • the processing module 30 is configured to: weighted and sum the first similarity and the second similarity of each video to obtain the similarity value of each video; The video with the highest value is determined as the video that matches the retrieval condition.
  • the retrieval device is implemented by a retrieval network, and the device further includes: a training module 40 configured to: determine the first similarity prediction value between the text and the video in the training sample set, the text Used to characterize retrieval conditions; determine the second similarity between the first character interaction image of the text and the second character interaction image of the video in the training sample set; according to the first similarity prediction value and the The first similarity true value determines the loss of the first similarity; the second similarity prediction value and the second similarity true value are used to determine the second similarity loss; and the second similarity loss is determined according to the first similarity
  • the loss of degree and the loss of the second similarity degree are combined with a loss function to determine a total loss value; and the weight parameter of the retrieval network is adjusted according to the total loss value.
  • the retrieval network includes a first sub-network and a second sub-network; the first sub-network is used to determine the first similarity between the text and the video, and the second sub-network is used to determine the text The similarity between the first character interaction image and the second character interaction image of the video; the training module 40 is configured to adjust the first sub-network and the second sub-network based on the total loss value The weight parameter of the network.
  • each processing module in the retrieval device shown in FIG. 3 can be understood with reference to the relevant description of the aforementioned retrieval method.
  • the function of each processing unit in the retrieval device shown in FIG. 3 can be implemented by a program running on a processor, or can be implemented by a specific logic circuit.
  • the specific structures of the above-mentioned first determination module 10, second determination module 20, processing module 30, and training module 40 can all correspond to processors.
  • the specific structure of the processor may be a central processing unit (CPU, Central Processing Unit), a microprocessor (MCU, Micro Controller Unit), a digital signal processor (DSP, Digital Signal Processing), or a programmable logic device (PLC, Programmable Logic Controller) and other electronic components or collections of electronic components with processing functions.
  • the processor includes executable code
  • the executable code is stored in a storage medium
  • the processor may be connected to the storage medium through a communication interface such as a bus.
  • a communication interface such as a bus.
  • the retrieval device provided by the embodiments of the present disclosure can improve the accuracy of retrieving videos based on text.
  • the embodiment of the present disclosure also records a retrieval device.
  • the device includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor implements any one of the foregoing when the program is executed.
  • the retrieval method provided by the technical solution.
  • the processor executes the program, it realizes: determining the first similarity between the text and at least one video, the text is used to characterize the retrieval condition; determining the first character interaction image of the text And the second character interaction picture of the at least one video; determine the second similarity between the first character interaction picture and the second character interaction picture; according to the first similarity and the second similarity Degree, the video that matches the retrieval condition is determined from the at least one video.
  • the processor executes the program, it is realized that: the determining the first similarity between the text and the at least one video includes: determining the paragraph feature of the text; determining the value of the at least one video Video feature; determining the first similarity between the text and the at least one video according to the paragraph feature of the text and the video feature of the at least one video.
  • the processor executes the program, it realizes: detecting the name of the person contained in the text; searching the database for the portrait of the person corresponding to the name of the person, and extracting the image characteristics of the portrait to obtain The character node of the character; parse and determine the semantic tree of the text, obtain the character's movement characteristics based on the semantic tree, and obtain the character's action node; and assign the character node and action node corresponding to each of the characters Connection; wherein the character node is characterized by the image feature of the portrait; the action node of the character is characterized by the motion feature in the semantic tree.
  • the processor executes the program, it realizes that: the role nodes connected to the same action node are connected to each other.
  • the processor executes the program, it implements: replacing pronouns in the text with the name of the person represented by the pronoun.
  • the processor executes the program, it realizes: detecting a person in each shot of the at least one video; extracting the human body characteristics and motion characteristics of the person; and converting the human body characteristics of the person It is attached to the character node of the character, and the movement feature of the character is added to the action node of the character; the character node corresponding to each character is connected to the action node.
  • a group of characters appearing in a shot at the same time are regarded as the same group of characters, and the character nodes of the characters in the same group of characters are connected in pairs.
  • the processor executes the program, it realizes that: a character in one shot is connected with the character node of each character in the adjacent shot.
  • the processor executes the program, it implements: weighted summation of the first similarity and the second similarity of each video to obtain the similarity value of each video; The video with the highest degree value is determined as the video that matches the retrieval condition.
  • the processor executes the program, it realizes: determining the first similarity prediction value between the text and the video in the training sample set, the text is used to characterize the retrieval condition; determining the first similarity of the text The second similarity between a character interaction image and the second character interaction image of the video in the training sample set; the first similarity is determined according to the first similarity prediction value and the first similarity true value The loss of degree; the loss of the second degree of similarity is determined according to the predicted value of the second similarity and the true value of the second degree of similarity; the loss of the second degree of similarity is determined according to the loss of the first degree of similarity and the loss of the second degree of similarity , Determine the total loss value in combination with the loss function; adjust the weight parameter of the retrieval network according to the total loss value.
  • the processor executes the program, it implements: adjusting the weight parameters of the first sub-network and the second sub-network based on the total loss value.
  • the retrieval device provided by the embodiments of the present disclosure can improve the accuracy of retrieving videos based on text descriptions.
  • the embodiments of the present disclosure also record a computer storage medium in which computer-executable instructions are stored, and the computer-executable instructions are used to execute the retrieval methods described in each of the foregoing embodiments.
  • the computer storage medium may be a volatile computer-readable storage medium or a non-volatile computer-readable storage medium.
  • the embodiments of the present disclosure also provide a computer program product, which includes computer-readable code, and when the computer-readable code runs on the device, the processor in the device executes the retrieval method provided in any of the above embodiments.
  • the above-mentioned computer program product can be specifically implemented by hardware, software, or a combination thereof.
  • the computer program product is specifically embodied as a computer storage medium.
  • the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc. Wait.
  • SDK software development kit
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the functional units in the embodiments of the present disclosure can be all integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit;
  • the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium.
  • the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
  • the medium storing the program code.
  • the aforementioned integrated unit of the present disclosure is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.
  • the technical solution provided by the embodiment of the present disclosure determines the first similarity between a text and at least one video, the text is used to characterize a search condition; the first character interaction diagram of the text and the first person interaction diagram of the at least one video are determined Two person interaction pictures; determine the second similarity between the first person interaction picture and the second person interaction picture; according to the first similarity and the second similarity, from the at least one video Identify a video that matches the retrieval condition in the.
  • the present disclosure determines the first similarity between the text and at least one video, the first character interaction diagram of the text and the second character interaction diagram of the at least one video.
  • the second degree of similarity between the two can use information such as the grammatical structure of the text itself and the event structure of the video itself to perform video retrieval, thereby improving the accuracy of retrieving videos such as movies based on text descriptions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种检索方法、检索装置、存储介质、以及计算机程序,其中,所述检索方法包括:确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件(S101);确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图(S102);确定所述第一人物互动图和所述第二人物互动图之间的第二相似度(S103);根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频(S104)。

Description

检索方法及装置、存储介质
相关申请的交叉引用
本公开基于申请号为201910934892.5、申请日为2019年09月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及计算机视觉技术领域,具体涉及一种检索方法及装置、存储介质。
背景技术
在现实生活中,根据一段文字描述,在视频数据库中检索符合文字描述的视频这项功能有着广泛的需求。传统的检索方法通常将文字编码为词向量,同时将视频编码成视频特征向量。
发明内容
本公开提供一种检索方法的技术方案。
根据本公开的第一方面,提供了一种检索方法,所述方法包括:确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
如此,相对于传统的基于特征的检索算法,本公开通过确定文本和至少一个视频之间的第一相似度,所述文本的第一人物互动图和所述至少一个视频的第二人物互动图之间的第二相似度,可以利用文字本身的语法结构以及视频本身的事件结构等信息,进行视频检索,从而能提高根据文本描述检索视频如电影的准确率。
在一种可能的实现方式中,所述确定文本和至少一个视频之间的第一相似度,包括:确定所述文本的段落特征;确定所述至少一个视频的视频特征;根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
如此,通过分析文本的段落特征和视频的视频特征来确定第一相似度,可得到视频和文本直接匹配的相似度,为后续确定与检索条件相匹配的视频提供参考依据。
在一种可能的实现方式中,所述段落特征包括句子特征和句子的数量;所述视频特征包括镜头特征和镜头的数量。
如此,通过将句子特征和句子的数量作为文本的段落特征,将镜头特征和镜头的数量作为视频的视频特征,对文本和视频进行了量化,进而能够为分析文本的段落特征和视频的视频特征提供分析依据。
在一种可能的实现方式中,所述确定所述文本的第一人物互动图,包括:检测所述文本中包含的人名;在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;将每个所述人物对应的角色节点和动作节点连接;其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
如此,文本中的句子通常遵循与事件中的情景相似的顺序,每一段文本都描述了视频中的一个事件,通过构建文本的人物交互图来捕捉视频的叙事结构,为后续确定与检索条件相匹配的视频提供参考依据。
在一种可能的实现方式中,所述方法还包括:将连接同一动作节点的角色节点相互连接。
如此,有助于更好地构建文本的人物交互图,进而更好地捕捉视频的叙事结构。
在一种可能的实现方式中,所述检测所述文本中包含的人名,包括:将所述文本中的代词替换为所述代词所代表的所述人名。
如此,防止漏掉文本中用非人名表示的人物,能够对文本中描述的所有人物进行分析,进而提高确定文本的人物互动图的准确率。
在一种可能的实现方式中,所述确定所述至少一个视频的第二人物互动图,包括:检测出所述至少一个视频的每个镜头中的人物;提取所述人物的人体特征与运动特征;将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的动作节点上;将每个人物对应的角色节点和动作节点相连。
如此,由于人物之间的相互作用经常在文本中描述,角色之间的互动在视频故事中扮演着重要的角色,为了结合这一点,本公开提出了一个基于图表表示的人物交互图,通过确定视频的人物交互图和文本的人物交互图之间的相似度,为后续确定与检索条件相匹配的视频提供参考依据。
在一种可能的实现方式中,所述确定所述至少一个视频的第二人物互动图,还包括:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
如此,有助于更好地构建视频的人物交互图,进而更好地捕捉视频的叙事结构。
在一种可能的实现方式中,所述确定所述至少一个视频的第二人物互动图,还包括:将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
如此,有助于更好地构建视频的人物交互图,进而更好地捕捉视频的叙事结构。
在一种可能的实现方式中,所述根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频,包括:对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
如此,结合第一相似度和第二相似度来确定与检索条件相匹配的视频,能提高根据文本描述检索视频的准确率。
在一种可能的实现方式中,所述检索方法通过检索网络实现,所述方法还包括:确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;确定所述文本的第一人 物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;根据所述总损失值调整所述检索网络的权重参数。
如此,通过检索网络实现检索,有助于快速检索出与文本描述相匹配的视频。
在一种可能的实现方式中,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定所述文本的第一人物互动图和所述视频的第二人物互动图之间的相似度;所述根据所述总损失值调整所述检索网络的权重参数,包括:基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
如此,通过不同的子网络分别确定不同的相似度,有助于快速得到与检索条件相关的第一相似度和第二相似度,进而能够快速检索出与检索条件相适应的视频。
根据本公开的第二方面,提供了一种检索装置,所述装置包括:第一确定模块,被配置为确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;第二确定模块,被配置为确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;处理模块,被配置为根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
在一种可能的实现方式中,所述第一确定模块,被配置为:确定所述文本的段落特征;确定所述至少一个视频的视频特征;根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
在一种可能的实现方式中,所述段落特征包括句子特征和句子的数量;所述视频特征包括镜头特征和镜头的数量。
在一种可能的实现方式中,所述第二确定模块,被配置为:检测所述文本中包含的人名;在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;将每个所述人物对应的角色节点和动作节点连接;其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
在一种可能的实现方式中,所述第二确定模块,还被配置为:将连接同一动作节点的角色节点相互连接。
在一种可能的实现方式中,所述第二确定模块,被配置为:将所述文本中的代词替换为所述代词所代表的所述人名。
在一种可能的实现方式中,所述第二确定模块,被配置为:检测出所述至少一个视频的每个镜头中的人物;提取所述人物的人体特征与运动特征;将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的动作节点上;将每个人物对应的角色节点和动作节点相连。
在一种可能的实现方式中,所述第二确定模块,还被配置为:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
在一种可能的实现方式中,所述第二确定模块,还被配置为:将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
在一种可能的实现方式中,所述处理模块,被配置为:对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
在一种可能的实现方式中,所述检索装置通过检索网络实现,所述装置还包括:训练模块,被配置为:确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;根据所述总损失值调整所述检索网络的权重参数。
在一种可能的实现方式中,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定文本的第一人物互动图和所述视频的第二人物互动图之间的相似度;所述训练模块,被配置为:基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
根据本公开的第三方面,提供了一种检索装置,所述装置包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现本公开实施例所述的检索方法的步骤。
根据本公开的第四方面,提供了一种存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行本公开实施例所述的检索方法的步骤。
根据本公开的第五方面,提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现本公开实施例所述的检索方法。
本公开提供的技术方案,确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。如此,相对于传统的基于特征的检索算法,本公开通过确定文本和至少一个视频之间的第一相似度,所述文本的第一人物互动图和所述至少一个视频的第二人物互动图之间的第二相似度,可以利用文字本身的语法结构以及视频本身的事件结构等信息,进行视频检索,从而能提高根据文本描述检索视频如电影的准确率。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例, 并与说明书一起用于说明本公开的技术方案。
图1是根据一示例性实施例示出的检索方法概述框架示意图;
图2是根据一示例性实施例示出的一种检索方法的实现流程示意图;
图3是根据一示例性实施例示出的一种检索装置的组成结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开实施例相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开实施例的一些方面相一致的装置和方法的例子。
在本公开实施例使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开实施例。在本公开实施例和所附权利要求书中所使用的单数形式的“一种”、“一个”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开实施例可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开实施例范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”及“若”可以被解释成为“在……时”或“当……时”或“响应于确定”。
下面结合附图和具体实施例对本公开的检索方法进行详细阐述。
图1是根据一示例性实施例示出的检索方法概述框架示意图,该框架用于匹配视频和文本,如匹配电影节段和剧情片段。该框架包括两类模块:事件流模块(EFM,Event Flow Module)和人物交互模块(CIM,Character Interaction Module);事件流模块被配置为探索事件流的事件结构,以段落特征和视频特征为输入,输出视频和段落直接的相似度;人物交互模块被配置为利用人物交互,分别构建段落中的人物互动图和视频中的人物互动图,再通过图匹配算法衡量二图之间的相似度。
给定一个查询文本P和一个候选视频Q,上述两个模块分别产生P和Q之间的相似度得分,分别表示为
Figure PCTCN2019118196-appb-000001
Figure PCTCN2019118196-appb-000002
然后将总匹配分数
Figure PCTCN2019118196-appb-000003
定义为它们的和:
Figure PCTCN2019118196-appb-000004
具体如何求解将
Figure PCTCN2019118196-appb-000005
Figure PCTCN2019118196-appb-000006
在下文中详细描述。
当然,在其他实施例中,总匹配分数也可以是上述两个模块得分的加权和等运算结果。
本公开实施例提供一种检索方法,此检索方法可应用于终端设备、服务器或其他电子设备。其中,终端设备可以为用户设备(UE,User Equipment)、移动设备、蜂窝电话、无绳电话、个人数字处理(PDA,Personal Digital Assistant)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该处理方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。 如图2所示,所述方法主要包括:
步骤S101、确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件。
这里,所述文本是用于表征检索条件的一段文字描述。本公开实施例对获取文本的方式不作限定。例如,电子设备可以接收用户在输入区输入的文字描述,或者,接收用户在语音输入,然后将语音数据转换成文字描述。
这里,所述检索条件包括人名和至少一个表征动作的动词。例如,杰克打了他自己一拳。
这里,所述至少一个视频位于可供检索的本地或第三方视频数据库中。
这里,所述第一相似度是表征视频和文本直接匹配的相似度。
在一个例子中,电子设备将文本的段落特征和视频的视频特征输入到事件流模块,由事件流模块输出视频和文本的相似度,即第一相似度。
在一些可选实现方式中,所述确定文本和至少一个视频之间的第一相似度,包括:
确定所述文本的段落特征,所述段落特征包括句子特征和句子的数量;
确定所述至少一个视频的视频特征,所述视频特征包括镜头特征和镜头的数量;
根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
在一些例子中,确定文本的段落特征,包括:可以利用第一神经网络对文本进行处理,得到文本的段落特征,所述段落特征包括句子特征和句子的数量。例如,每个单词对应一个300维的向量,将句子中每个单词的特征加起来就是句子的特征。句子数量是指文本中的句号的数量,将输入的文本用句号将句子分割开,得到句子的数量。
在一些例子中,确定视频的视频特征,包括:可以利用第二神经网络对视频进行处理,具体地,先将视频解码成图片流,然后基于图片流得到视频特征;所述视频特征包括镜头特征和镜头的数量。例如,镜头特征是将镜头的3张关键帧的图片通过神经网络得到3个2348维的向量,再取平均。一个镜头是指视频中同一摄像机在同一机位拍摄的连续画面,如果画面切换则是另一个镜头,按照现有的镜头切割算法来得到镜头的数量。
如此,通过分析文本的段落特征和视频的视频特征来确定第一相似度,为后续确定出与检索条件相匹配的视频提供依据;利用文字本身的语法结构以及视频本身的事件结构等信息,进行视频检索,从而能提高根据文本描述检索视频的准确率。
上述方案中,可选地,所述第一相似度的计算公式为:
Figure PCTCN2019118196-appb-000007
其中,一个段落特征由M个句子特征组成,设句子特征为
Figure PCTCN2019118196-appb-000008
则段落特征表示为Φ=[φ 1,···,φ M] T;一个视频特征由N个镜头特征组成,设镜头特征为
Figure PCTCN2019118196-appb-000009
则视频特征表示为Ψ=[ψ 1,···,ψ N] T;设布尔分配矩阵Y∈{0,1} N×M,用于将每个镜头分配给每个句子,其中y ij=Y(i,j)=1代表第i个镜头被分配给第j个句子,y ij=Y(i,j)=0代表第i个镜头未被分配给第 j个句子。
上述方案中,可选地,所述第一相似度的计算公式的约束条件包括:
每个镜头最多被分配给1个句子;
序号靠前的镜头被分配到的句子,相对于序号在后的镜头被分配到的句子,更靠前。
因此,可将计算第一相似度转化为求解如下公式(3)的优化目标,将优化目标和约束条件联合起来,可以得到如下优化公式:
max Y tr(ΦΨ TY)       式(3)
s.t. Y1≤1           式(4)
Figure PCTCN2019118196-appb-000010
其中,公式(3)是优化目标;s.t.是such that的缩写,引出表示公式(3)约束条件的公式(4)和(5);y i表示Y的第i行向量,
Figure PCTCN2019118196-appb-000011
表示一个布尔向量的第一个非零值的序号。公式(4)中,Y是一个矩阵,1是一个向量(所有元素都是1的向量),Y1是矩阵Y和向量1的乘积。
进一步地,通过传统的动态规划算法,可以得到该优化问题的解。具体地,通过动态规划算法相关算法,可以解得最优的Y,从而得到
Figure PCTCN2019118196-appb-000012
的值。
在其他实施例中,也可以对段落特征和视频特征进行其他类型的计算,例如多个段落特征和对应的多个视频特征进行加权或比例运算等,得到所述第一相似度。
步骤S102、确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图。
这里,人物互动图是用于表征人物之间的角色关系和动作关系的图,包括角色节点和动作节点。
在一些可选实施方式中,一个文本对应一个第一人物互动图,一个视频对应一个第二人物互动图。
在一些可选实施方式中,所述确定所述文本的第一人物互动图,包括:检测所述文本中包含的人名;在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;将每个所述人物对应的角色节点和动作节点连接。
其中,数据库是预先存储有大量的人名和肖像的对应关系的库,所述肖像是与该人名对应的人物的肖像。肖像数据可从网络上爬取,如可从imdb网站和tmdb网站上爬取到肖像数据。其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
在一些实施例中,解析确定所述文本的语义树,包括:通过依存句法算法解析确定文本的语义树。例如,利用依存句法算法将每句话分成一个一个的词,然后根据语言学的一些规则,把词作为节点,建一棵语义树。
先将每个句子得到一个图,然后每一段有多个句子,就是多个图。但是,在数学上,我们可以把这几个图看成一个图(一个非连接图)。也就是说,在数学上图的定义不一定是要每个节点到另一个节点都有路径可以达到的,也可以是那种可分割成几个小图的图。
其中,如果多个人名指向同一个动作节点,则将所述多个人名的动作节点两两之间用边连接。
其中,边连接的两个节点特征拼接作为边的特征。
示例性地,可将边连接的两个节点特征分别表示为两个向量,将该两个向量进行拼接(例如维度相加),则得到边的特征。比如一个向量3维,另一个向量4维度,直接拼接成7维的向量。举例来说,若将[1,3,4]和[2,5,3,6]拼接,则拼接的结果是[1,3,4,2,5,3,6]。
在一些例子中,可以采用Word2Vec词向量经神经网络处理后的特征作为动作节点的表征,即作为人物的运动特征。
在一些例子中,检测文本中包含的人名时,将文本中的代词替换为所述代词所代表的人名。具体地,通过人名检测工具(如斯坦福人名检测工具包)检测出所有的人名(如“杰克”)。之后通过共指解析工具将代词替换成该词所代表的人名(如“杰克打了他自己一拳”中的“他”提取为“杰克”)。
在一些实施例中,基于人名在数据库中搜索到所述人名对应的人物的肖像,并通过神经网络提取所述肖像的图像特征;其中,所述图像特征包括人脸和身体特征。通过神经网络确定所述文本中每个句子的语义树以及所述语义树上每个词的词性,如名词、代词、动词等,所述语义树上每个节点是所述句子中的一个词,将句子中的动词作为人物的运动特征,即动作节点,将名词或代词对应的人名作为人物角色节点,将人物的肖像的图像特征附加到人物角色节点;根据所述语义树和所述人名,将每个所述人名对应的角色节点和所述人名的动作节点连接,如果多个人名指向同一个动作节点,则所述多个人名两两之间用边连接。
在一些可选实施方式中,所述确定所述至少一个视频的第二人物互动图,包括:
检测出所述至少一个视频的每个镜头中的人物;
提取所述人物的人体特征与运动特征;
将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的运动节点上;
将每个人物对应的角色节点和运动节点相连。
这里,一个镜头是指视频中同一摄像机在同一机位拍摄的连续画面,如果画面切换则是另一个镜头,按照现有的镜头切割算法来得到镜头的数量。
这里,所述人体特征是人物的人脸和身体特征,将镜头对应的图像通过训练好的模型可以得到图像中的人物的人体特征。
这里,所述运动特征是将镜头对应的图像输入训练好的模型得到的图像中的人物的运动特征,例如识别得到的人物在当前图像中的动作(如喝水)。
进一步地,所述确定所述至少一个视频的第二人物互动图时,还包括:如果一组人物同时出现在一个镜头中,则将同组人物中的人物的角色节点两两相连;将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
这里,所述相邻镜头是指当前镜头的前一个镜头和后一个镜头。
其中,如果多个角色节点指向同一个动作节点,则将所述多个角色节点的动作节点两两之间用 边连接。
其中,边连接的两个节点特征拼接作为边的特征。
上述边特征的确定过程可参考第一人物互动图中边特征的确定方法,此处不再赘述。
步骤S103、确定所述第一人物互动图和所述第二人物互动图之间的第二相似度。
这里,所述第二相似度是表征第一人物互动图和第二人物互动图二图进行匹配计算得到的相似度。
在一个例子中,电子设备将文本和视频输入到人物互动模块,由人物互动模块构建文本中的第一人物互动图和视频中的第二人物互动图,再通过图匹配算法衡量二图之间的相似度,输出该相似度,即第二相似度。
在一些可选实施方式中,所述第二相似度的计算公式为:
Figure PCTCN2019118196-appb-000013
其中,u是二值向量(布尔向量),u ia=1代表V p里第i个节点和V q里第a个节点能匹配上,u ia=0代表V p里第i个节点和V q里第a个节点不能匹配上。同理,u jb=1代表V p里第j个节点和V q里第b个节点能匹配上,u jb=0代表V p里第j个节点和V q里第b个节点不能匹配上;i,a,j,b都是索引符号;k ia;ia代表V p里第i个节点和V q里第a个节点的相似度,k ia;jb代表E p里的边(i,j)和E q里的边(a,b)的相似度。
设文本中的第一人物互动图为
Figure PCTCN2019118196-appb-000014
其中,V p是节点的集合,E p是边的集合;V p由两种节点构成,
Figure PCTCN2019118196-appb-000015
为第一人物互动图中的动作节点,
Figure PCTCN2019118196-appb-000016
为第一人物互动图中的角色节点;
设视频中的第二人物互动图为
Figure PCTCN2019118196-appb-000017
其中,V q是节点的集合,E q是边的集合;V q由两种节点构成,
Figure PCTCN2019118196-appb-000018
为第二人物互动图中的动作节点,
Figure PCTCN2019118196-appb-000019
为第一人物互动图中的角色节点;
|V p|=m=m a+m c,m a为动作节点数量,m c为角色节点数量;
|V q|=n=n a+n c,n a为动作节点数量,n c为角色节点数量;
给定布尔向量u∈{0,1} nm×1,如果u ia=1,则代表i∈V q被匹配到a∈V p;相似度矩阵
Figure PCTCN2019118196-appb-000020
相似度矩阵K对角线元素为节点的相似度k ia;ia=K(ia,ia),衡量V q中第i个节点和V p中第a个节点的相似度;k ia;jb=K(ia,jb)衡量边(i,j)∈E q和边(a,b)∈E p的相似度,相似度由节点或边对应的特征,通过点积处理可得。
在一些可选实施方式中,所述第二相似度的计算公式的约束条件包括:
一个节点只能被匹配到另一个集合的最多一个节点;
不同类型的节点不能被匹配。
也就是说,匹配必须是一对一匹配,即一个节点之内被匹配到另一个集合的最多一个节点。不同类型的节点不能被匹配,比如角色节点不能被另一集合的动作节点所匹配。
因此,计算上述第二相似度可转化为求解如下优化公式(7),最终的优化公式和上述约束条件结合起来,可以得到:
max u u TKu,            式(7)
s.t.   ∑ iu ia≤1
Figure PCTCN2019118196-appb-000021
au ia≤1
Figure PCTCN2019118196-appb-000022
Figure PCTCN2019118196-appb-000023
Figure PCTCN2019118196-appb-000024
在解优化的过程中,会得到u,将u带入公式(7)就能得到相似度。
在其他实施例中,也可以通过其他运算方式,例如对匹配的节点特征和动作特征进行加权平均等运算,得到所述第二相似度。
步骤S104、根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
在一些可选实施方式中,所述根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频,包括:对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
在一些实施例中,权重通过数据库中的验证集确定,在验证集上可以通过调权重方式,根据最终检索结果反馈得到一组最优的权重,进而可直接用到测试集上或直接用到实际检索中。
如此,利用文字本身的语法结构以及视频本身的事件结构等信息,进行视频检索,将相似度值最高的视频,确定为与所述检索条件相匹配的视频,能提高根据文本描述检索视频的准确率。
当然,在其他实施例中,也可以直接将第一相似度和第二相似度相加,得到每个视频对应的相似度。
上述方案中,所述检索方法通过检索网络实现,该检索网络的训练方法,包括:确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;根据所述总损失值调整所述检索网络的权重参数。
本公开实施例中,所述检索网络对应的检索框架里有不同的组成模块,每个模块里可使用不同类型的神经网络。所述检索框架是事件流模块和人物关系模块共同组成的框架。
在一些可选实施方式中,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定文本的第一人物互动图和所述视频的第二人物互动图之间的相似度。
具体地,将文本和视频输入第一子网络,该第一子网络输出文本与视频的第一相似度预测值;将文本和视频输入第二子网络,该第二子网络输出文本的第一人物互动图和所述视频的第二人物互动图之间的相似度预测值;根据标注的真值,能够得到文本与视频的第一相似度真值,以及所述文本的第一人物互动图和所述视频的第二人物互动图之间的相似度真值,根据第一相似度预测值和第 一相似度真值的差异,可得到第一相似度的损失;根据第二相似度预测值和第二相似度真值得差异,可得到第二相似度的损失;根据第一相似度的损失和第二相似度的损失,再结合损失函数调整第一子网络和第二自网络的网络参数。
在一个例子中,构建了一个数据集,它包含了328部电影的概要,以及概要段落和电影片段之间的注释关联。具体地,该数据集不仅为每部电影提供了高质量的详细概要,而且还通过手动注释将概要的各个段落与电影片段相关联;在这里,每个电影片段可以持续到每个分钟和捕获完整事件。这些电影片段,再加上相关的概要段落,可以让人在更大的范围和更高的语义层次上进行分析。在这个数据集的基础上,本公开利用一个包括事件流模块和人物交互模块的框架来执行电影片段和概要段落之间的匹配。与传统的基于特征的匹配方法相比,该框架可显著提高匹配精度,同时也揭示了叙事结构和人物互动在电影理解中的重要性。
在一些可选实施方式中,所述根据所述总损失值调整所述检索网络的权重参数,包括:
基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
在一些可选实施方式中,所述损失函数表示为:
Figure PCTCN2019118196-appb-000025
其中,θ efm表示在事件流模块中嵌入网络的模型参数,θ cim表示在人物交互模块中嵌入网络的模型参数。
其中,Y是事件流模块定义的二值矩阵,u是人物互动模块的二值向量,公式(12)表示通过最小化函数
Figure PCTCN2019118196-appb-000026
来调整网络的参数,例如下面公式(13)所示得到新的网络参数
Figure PCTCN2019118196-appb-000027
Figure PCTCN2019118196-appb-000028
其中,
Figure PCTCN2019118196-appb-000029
表示为:
Figure PCTCN2019118196-appb-000030
其中,Y *是使得公式(3)的值最大的Y,也称之为最优解。
其中,u *是使得公式(7)最大的u。
其中,S(Q i,P j)表示第i个视频Q i与第j个段落P j的相似度;S(Q i,P i)表示第i个视频Q i与第i个段落P i的相似度,S(Q j,P i)表示第j个视频Q j与第i个段落P i的相似度;α为损失函数的参数,表示最小相似度差值。
本公开所述技术方案可用于各种检索任务中,对检索场景不做限定,比如检测场景包括电影片段检索场景、电视剧片段检索场景、短视频检索场景等。
本公开实施例提出的检索方法,确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。如此,相对于传统的基于特征的检索 算法,本公开通过确定文本和至少一个视频之间的第一相似度,所述文本的第一人物互动图和所述至少一个视频的第二人物互动图之间的第二相似度,解决了传统的基于特征的检索算法没有利用文字本身的语法结构以及视频本身的事件结构等信息的问题,采用事件流匹配的方法和基于人物互动图匹配的方法进行视频检索,能提高根据文本描述检索视频的准确率。
对应上述检索方法,本公开实施例提供了一种检索装置,如图3所示,所述装置包括:第一确定模块10,用于被配置为文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;第二确定模块20,被配置为确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;处理模块30,被配置为根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
在一些实施例中,所述第一确定模块10,被配置为:确定所述文本的段落特征;确定所述至少一个视频的视频特征;根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
在一些实施例中,所述段落特征包括句子特征和句子的数量;所述视频特征包括镜头特征和镜头的数量。
在一些实施例中,所述第二确定模块20,被配置为:检测所述文本中包含的人名;在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;将每个所述人物对应的角色节点和动作节点连接;其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
在一些实施例中,所述第二确定模块20,还被配置为:将连接同一动作节点的角色节点相互连接。
在一些实施例中,所述第二确定模块20,被配置为:将所述文本中的代词替换为所述代词所代表的所述人名。
在一些实施例中,所述第二确定模块20,被配置为:检测出所述至少一个视频的每个镜头中的人物;提取所述人物的人体特征与运动特征;将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的动作节点上;将每个人物对应的角色节点和动作节点相连。
在一些实施例中,所述第二确定模块20,还被配置为:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
在一些实施例中,所述第二确定模块20,还被配置为:将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
在一些实施例中,所述处理模块30,被配置为:对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
在一些实施例中,所述检索装置通过检索网络实现,所述装置还包括:训练模块40,被配置为: 确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;根据所述总损失值调整所述检索网络的权重参数。
在一些实施例中,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定文本的第一人物互动图和所述视频的第二人物互动图之间的相似度;所述训练模块40,被配置为:基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
本领域技术人员应当理解,图3中所示的检索装置中的各处理模块的实现功能可参照前述检索方法的相关描述而理解。本领域技术人员应当理解,图3所示的检索装置中各处理单元的功能可通过运行于处理器上的程序而实现,也可通过具体的逻辑电路而实现。
实际应用中,上述第一确定模块10、第二确定模块20、处理模块30和训练模块40的具体结构均可对应于处理器。所述处理器具体的结构可以为中央处理器(CPU,Central Processing Unit)、微处理器(MCU,Micro Controller Unit)、数字信号处理器(DSP,Digital Signal Processing)或可编程逻辑器件(PLC,Programmable Logic Controller)等具有处理功能的电子元器件或电子元器件的集合。其中,所述处理器包括可执行代码,所述可执行代码存储在存储介质中,所述处理器可以通过总线等通信接口与所述存储介质中相连,在执行具体的各单元的对应功能时,从所述存储介质中读取并运行所述可执行代码。所述存储介质用于存储所述可执行代码的部分优选为非瞬间存储介质。
本公开实施例提供的检索装置,能提高根据文本检索视频的准确率。
本公开实施例还记载了一种检索装置,所述装置包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现前述任意一个技术方案提供的检索方法。
作为一种实施方式,所述处理器执行所述程序时实现:确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
作为一种实施方式,所述处理器执行所述程序时实现:所述确定文本和至少一个视频之间的第一相似度,包括:确定所述文本的段落特征;确定所述至少一个视频的视频特征;根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
作为一种实施方式,所述处理器执行所述程序时实现:检测所述文本中包含的人名;在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;将每个所述人物对应的角色节点和动作节点连接;其中,所述人物的角色节点用肖像的图像特征表 征;所述人物的动作节点采用语义树中的运动特征表征。
作为一种实施方式,所述处理器执行所述程序时实现:将连接同一动作节点的角色节点相互连接。
作为一种实施方式,所述处理器执行所述程序时实现:将所述文本中的代词替换为所述代词所代表的所述人名。
作为一种实施方式,所述处理器执行所述程序时实现:检测出所述至少一个视频的每个镜头中的人物;提取所述人物的人体特征与运动特征;将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的动作节点上;将每个人物对应的角色节点和动作节点相连。
作为一种实施方式,所述处理器执行所述程序时实现:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
作为一种实施方式,所述处理器执行所述程序时实现:将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
作为一种实施方式,所述处理器执行所述程序时实现:对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
作为一种实施方式,所述处理器执行所述程序时实现:确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;根据所述总损失值调整检索网络的权重参数。
作为一种实施方式,所述处理器执行所述程序时实现:基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
本公开实施例提供的检索装置,能提高根据文本描述检索视频的准确率。
本公开实施例还记载了一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行前述各个实施例所述的检索方法。也就是说,所述计算机可执行指令被处理器执行之后,能够实现前述任意一个技术方案提供的检索方法。该计算机存储介质可以是易失性计算机可读存储介质或非易失性计算机可读存储介质。
本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,当计算机可读代码在设备上运行时,设备中的处理器执行用于实现如上任一实施例提供的检索方法。
该上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
本领域技术人员应当理解,本实施例的计算机存储介质中各程序的功能,可参照前述各实施例所述的检索方法的相关描述而理解。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本公开上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。
工业实用性
本公开实施例提供的技术方案,确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。如此,相对于传统的基于特征的检索算法,本公开通过确定文本和至少一个视频之间的第一相似度,所述文本的第一人物互动图和所述 至少一个视频的第二人物互动图之间的第二相似度,可以利用文字本身的语法结构以及视频本身的事件结构等信息,进行视频检索,从而能提高根据文本描述检索视频如电影的准确率。

Claims (27)

  1. 一种检索方法,所述方法包括:
    确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;
    确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;
    确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;
    根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
  2. 根据权利要求1所述的检索方法,其中,所述确定文本和至少一个视频之间的第一相似度,包括:
    确定所述文本的段落特征;
    确定所述至少一个视频的视频特征;
    根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
  3. 根据权利要求2所述的检索方法,其中,所述段落特征包括句子特征和句子的数量;所述视频特征包括镜头特征和镜头的数量。
  4. 根据权利要求1至3任一项所述的检索方法,其中,所述确定所述文本的第一人物互动图,包括:
    检测所述文本中包含的人名;
    在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;
    解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;
    将每个所述人物对应的角色节点和动作节点连接;
    其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
  5. 根据权利要求4所述的检索方法,其中,所述方法还包括:
    将连接同一动作节点的角色节点相互连接。
  6. 根据权利要求4或5所述的检索方法,其中,所述检测所述文本中包含的人名,包括:
    将所述文本中的代词替换为所述代词所代表的所述人名。
  7. 根据权利要求1至6任一项所述的检索方法,其中,所述确定所述至少一个视频的第二人物互动图,包括:
    检测出所述至少一个视频的每个镜头中的人物;
    提取所述人物的人体特征与运动特征;
    将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物 的动作节点上;
    将每个人物对应的角色节点和动作节点相连。
  8. 根据权利要求7所述的检索方法,其中,所述确定所述至少一个视频的第二人物互动图,还包括:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
  9. 根据权利要求7或8所述的检索方法,其中,所述确定所述至少一个视频的第二人物互动图,还包括:
    将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
  10. 根据权利要求1至9任一项所述的检索方法,其中,所述根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频,包括:
    对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;
    将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
  11. 根据权利要求1至10任一项所述的检索方法,其中,所述检索方法通过检索网络实现,所述方法还包括:
    确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;
    确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;
    根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;
    根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;
    根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;
    根据所述总损失值调整所述检索网络的权重参数。
  12. 根据权利要求11所述的检索方法,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定文本的第一人物互动图和所述视频的第二人物互动图之间的相似度;
    所述根据所述总损失值调整所述检索网络的权重参数,包括:
    基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
  13. 一种检索装置,所述装置包括:
    第一确定模块,被配置为确定文本和至少一个视频之间的第一相似度,所述文本用于表征检索条件;
    第二确定模块,被配置为确定所述文本的第一人物互动图和所述至少一个视频的第二人物互动图;确定所述第一人物互动图和所述第二人物互动图之间的第二相似度;
    处理模块,被配置为根据所述第一相似度和所述第二相似度,从所述至少一个视频中确定出与所述检索条件相匹配的视频。
  14. 根据权利要求13所述的检索装置,其中,所述第一确定模块,被配置为:
    确定所述文本的段落特征;
    确定所述至少一个视频的视频特征;
    根据所述文本的段落特征和所述至少一个视频的视频特征,确定所述文本和所述至少一个视频之间的第一相似度。
  15. 根据权利要求14所述的检索装置,其中,所述段落特征包括句子特征和句子的数量;所述视频特征包括镜头特征和镜头的数量。
  16. 根据权利要求13至15任一项所述的检索装置,其中,所述第二确定模块,被配置为:
    检测所述文本中包含的人名;
    在数据库中搜索到所述人名对应的人物的肖像,并提取所述肖像的图像特征,得到所述人物的角色节点;
    解析确定所述文本的语义树,基于所述语义树得到所述人物的运动特征,得到所述人物的动作节点;
    将每个所述人物对应的角色节点和动作节点连接;
    其中,所述人物的角色节点用肖像的图像特征表征;所述人物的动作节点采用语义树中的运动特征表征。
  17. 根据权利要求16所述的检索装置,其中,所述第二确定模块,还被配置为:
    将连接同一动作节点的角色节点相互连接。
  18. 根据权利要求16或17所述的检索装置,其中,所述第二确定模块,被配置为:
    将所述文本中的代词替换为所述代词所代表的所述人名。
  19. 根据权利要求13至18任一项所述的检索装置,其中,所述第二确定模块,被配置为:
    检测出所述至少一个视频的每个镜头中的人物;
    提取所述人物的人体特征与运动特征;
    将所述人物的人体特征附加到所述人物的角色节点上,将所述人物的运动特征附加到所述人物的动作节点上;
    将每个人物对应的角色节点和动作节点相连。
  20. 根据权利要求19所述的检索装置,其中,所述第二确定模块,还被配置为:将同时出现在一个镜头中的一组人物作为同组人物,将所述同组人物中的人物的角色节点两两相连。
  21. 根据权利要求19或20所述的检索装置,其中,所述第二确定模块,还被配置为:
    将一个镜头中的一位人物和其相邻镜头的每个人物的角色节点都相连。
  22. 根据权利要求13至21任一项所述的检索装置,其中,所述处理模块,被配置为:
    对每个视频的所述第一相似度和所述第二相似度加权求和,得到每个视频的相似度值;
    将相似度值最高的视频,确定为与所述检索条件相匹配的视频。
  23. 根据权利要求13至22任一项所述的检索装置,其中,所述检索装置通过检索网络实现,所述装置还包括:
    训练模块,被配置为:
    确定文本和训练样本集中的视频之间的第一相似度预测值,所述文本用于表征检索条件;
    确定所述文本的第一人物互动图和所述训练样本集中的视频的第二人物互动图之间的第二相似度;
    根据所述第一相似度预测值与所述第一相似度真值确定所述第一相似度的损失;
    根据所述第二相似度预测值与所述第二相似度真值确定所述第二相似度的损失;
    根据所述第一相似度的损失以及所述第二相似度的损失,结合损失函数确定总损失值;
    根据所述总损失值调整所述检索网络的权重参数。
  24. 根据权利要求23所述的检索装置,所述检索网络包括第一子网络以及第二子网络;所述第一子网络用于确定文本与视频的第一相似度,所述第二子网络用于确定文本的第一人物互动图和所述视频的第二人物互动图之间的相似度;
    所述训练模块,被配置为:
    基于所述总损失值调整所述第一子网络以及所述第二子网络的权重参数。
  25. 一种检索装置,所述装置包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至12任一项所述的检索方法。
  26. 一种存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时,能够使得所述处理器执行权利要求1至12任一项所述的检索方法。
  27. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至12中的任一项所述的检索方法。
PCT/CN2019/118196 2019-09-29 2019-11-13 检索方法及装置、存储介质 WO2021056750A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
SG11202107151TA SG11202107151TA (en) 2019-09-29 2019-11-13 Search method and device, and storage medium
KR1020217011348A KR20210060563A (ko) 2019-09-29 2019-11-13 검색 방법 및 장치, 저장 매체
JP2021521293A JP7181999B2 (ja) 2019-09-29 2019-11-13 検索方法及び検索装置、記憶媒体
US17/362,803 US20210326383A1 (en) 2019-09-29 2021-06-29 Search method and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910934892.5A CN110659392B (zh) 2019-09-29 2019-09-29 检索方法及装置、存储介质
CN201910934892.5 2019-09-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/362,803 Continuation US20210326383A1 (en) 2019-09-29 2021-06-29 Search method and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021056750A1 true WO2021056750A1 (zh) 2021-04-01

Family

ID=69038407

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118196 WO2021056750A1 (zh) 2019-09-29 2019-11-13 检索方法及装置、存储介质

Country Status (7)

Country Link
US (1) US20210326383A1 (zh)
JP (1) JP7181999B2 (zh)
KR (1) KR20210060563A (zh)
CN (1) CN110659392B (zh)
SG (1) SG11202107151TA (zh)
TW (1) TWI749441B (zh)
WO (1) WO2021056750A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259118B (zh) * 2020-05-06 2020-09-01 广东电网有限责任公司 一种文本数据检索方法及装置
CN112256913A (zh) * 2020-10-19 2021-01-22 四川长虹电器股份有限公司 一种基于图模型比对的视频搜索方法
CN113204674B (zh) * 2021-07-05 2021-09-17 杭州一知智能科技有限公司 基于局部-整体图推理网络的视频-段落检索方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060018516A1 (en) * 2004-07-22 2006-01-26 Masoud Osama T Monitoring activity using video information
CN103440274A (zh) * 2013-08-07 2013-12-11 北京航空航天大学 一种基于细节描述的视频事件概要图构造和匹配方法
CN105279495A (zh) * 2015-10-23 2016-01-27 天津大学 一种基于深度学习和文本总结的视频描述方法
CN106127803A (zh) * 2016-06-17 2016-11-16 北京交通大学 人体运动捕捉数据行为分割方法及系统
CN106462747A (zh) * 2014-06-17 2017-02-22 河谷控股Ip有限责任公司 活动识别系统和方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7877774B1 (en) * 1999-04-19 2011-01-25 At&T Intellectual Property Ii, L.P. Browsing and retrieval of full broadcast-quality video
JP4909200B2 (ja) * 2006-10-06 2012-04-04 日本放送協会 人間関係グラフ生成装置及びコンテンツ検索装置、並びに、人間関係グラフ生成プログラム及びコンテンツ検索プログラム
US8451292B2 (en) * 2009-11-23 2013-05-28 National Cheng Kung University Video summarization method based on mining story structure and semantic relations among concept entities thereof
JP5591670B2 (ja) * 2010-11-30 2014-09-17 株式会社東芝 電子機器、人物相関図出力方法、人物相関図出力システム
CN103365854A (zh) * 2012-03-28 2013-10-23 鸿富锦精密工业(深圳)有限公司 视频文件检索系统及检索方法
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
JP6446987B2 (ja) * 2014-10-16 2019-01-09 日本電気株式会社 映像選択装置、映像選択方法、映像選択プログラム、特徴量生成装置、特徴量生成方法及び特徴量生成プログラム
JP2019008684A (ja) * 2017-06-28 2019-01-17 キヤノンマーケティングジャパン株式会社 情報処理装置、情報処理システム、情報処理方法およびプログラム
CN109783655B (zh) * 2018-12-07 2022-12-30 西安电子科技大学 一种跨模态检索方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060018516A1 (en) * 2004-07-22 2006-01-26 Masoud Osama T Monitoring activity using video information
CN103440274A (zh) * 2013-08-07 2013-12-11 北京航空航天大学 一种基于细节描述的视频事件概要图构造和匹配方法
CN106462747A (zh) * 2014-06-17 2017-02-22 河谷控股Ip有限责任公司 活动识别系统和方法
CN105279495A (zh) * 2015-10-23 2016-01-27 天津大学 一种基于深度学习和文本总结的视频描述方法
CN106127803A (zh) * 2016-06-17 2016-11-16 北京交通大学 人体运动捕捉数据行为分割方法及系统

Also Published As

Publication number Publication date
TW202113575A (zh) 2021-04-01
KR20210060563A (ko) 2021-05-26
CN110659392A (zh) 2020-01-07
JP2022505320A (ja) 2022-01-14
TWI749441B (zh) 2021-12-11
SG11202107151TA (en) 2021-07-29
CN110659392B (zh) 2022-05-06
US20210326383A1 (en) 2021-10-21
JP7181999B2 (ja) 2022-12-01

Similar Documents

Publication Publication Date Title
CN112131366B (zh) 训练文本分类模型及文本分类的方法、装置及存储介质
WO2022155994A1 (zh) 基于注意力的深度跨模态哈希检索方法、装置及相关设备
CN112270196B (zh) 实体关系的识别方法、装置及电子设备
WO2021051871A1 (zh) 文本抽取方法、装置、设备及存储介质
WO2020233380A1 (zh) 缺失语义补全方法及装置
CN113627447B (zh) 标签识别方法、装置、计算机设备、存储介质及程序产品
CN111143576A (zh) 一种面向事件的动态知识图谱构建方法和装置
US8577882B2 (en) Method and system for searching multilingual documents
TWI749441B (zh) 檢索方法及裝置、儲存介質
CN111159485B (zh) 尾实体链接方法、装置、服务器及存储介质
CN111597314A (zh) 推理问答方法、装置以及设备
WO2023179429A1 (zh) 一种视频数据的处理方法、装置、电子设备及存储介质
WO2018232699A1 (zh) 一种信息处理的方法及相关装置
Nian et al. Learning explicit video attributes from mid-level representation for video captioning
CN110234018A (zh) 多媒体内容描述生成方法、训练方法、装置、设备及介质
JP6729095B2 (ja) 情報処理装置及びプログラム
WO2022134793A1 (zh) 视频帧语义信息的提取方法、装置及计算机设备
CN113704460A (zh) 一种文本分类方法、装置、电子设备和存储介质
WO2021012958A1 (zh) 原创文本甄别方法、装置、设备与计算机可读存储介质
CN110852066A (zh) 一种基于对抗训练机制的多语言实体关系抽取方法及系统
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
CN112529743B (zh) 合同要素抽取方法、装置、电子设备及介质
CN118133839A (zh) 基于语义信息推理和跨模态交互的图文检索方法及系统
CN114417823A (zh) 一种基于句法和图卷积网络的方面级情感分析方法及装置
CN113342944A (zh) 一种语料泛化方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021521293

Country of ref document: JP

Kind code of ref document: A

Ref document number: 20217011348

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946811

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946811

Country of ref document: EP

Kind code of ref document: A1