CN113343029A - Social relationship enhanced complex video character retrieval method - Google Patents

Social relationship enhanced complex video character retrieval method Download PDF

Info

Publication number
CN113343029A
CN113343029A CN202110677925.XA CN202110677925A CN113343029A CN 113343029 A CN113343029 A CN 113343029A CN 202110677925 A CN202110677925 A CN 202110677925A CN 113343029 A CN113343029 A CN 113343029A
Authority
CN
China
Prior art keywords
nodes
node
candidate
query
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110677925.XA
Other languages
Chinese (zh)
Other versions
CN113343029B (en
Inventor
徐童
陈恩红
李丹
周培伦
何伟栋
郝艳宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110677925.XA priority Critical patent/CN113343029B/en
Publication of CN113343029A publication Critical patent/CN113343029A/en
Application granted granted Critical
Publication of CN113343029B publication Critical patent/CN113343029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a social relationship enhanced complex video character retrieval method, on one hand, semantic information related to a target character can be fully mined and revealed by taking the social relationship as a link, so that all outgoing segments of the target character can be accurately retrieved, and a foundation can be provided for supporting other related applications; on the other hand, a more excellent complex character extraction model can be provided, and better results can be obtained on subjective and objective indexes such as accuracy, recall rate and fluency.

Description

Social relationship enhanced complex video character retrieval method
Technical Field
The invention relates to the technical field of computer vision, in particular to a complex video character retrieval method with enhanced social relationship.
Background
People retrieval in complex video is an important problem for video analysis, which is to extract all segments of the video that contain a particular person in the video from a complete video. With the rise of a large number of emerging video media platforms, the application scenes of the video media platforms in reality are increasingly wide. Some movie enthusiasts or viewers who enjoy certain characters may be willing to produce special character-oriented summaries, such as clips of the segments of a particular star that are present in a certain movie work. And how to utilize computer technology to automatically and effectively extract and understand the information responsible for the video content, thereby retrieving the target person, better helping people to quickly and accurately understand the video content, and having good application value. Many intelligent video analysis functions have appeared in the current mainstream video media platform, so that users can inquire and understand video contents more conveniently and quickly. For example, the function of 'only watching the user' introduced by the Youkou video and the Aichi art video can automatically generate the clip of a specific video character according to the preference of the user, thereby realizing the concentration and key information extraction of massive video data.
However, the conventional person retrieval methods are all based on visual features, but rarely utilize high-level semantic information which is rich in images and texts. In fact, besides the visual features formed by the video frames, the video also contains a great deal of different types of text information, such as subtitles and barrage, and the visual and text information can jointly reveal the scene information of the current segment, and can provide reliable high-level semantic clues when the visual information quality is low. Therefore, if the high-level semantic clues can be described formally and assist people in retrieval in the complex video, the task of people-oriented video summarization can be better completed. More importantly, when the semantic information is combined with the social relationship of the person, the formed social context provides a very important clue for retrieving the person. For example, if it can be judged that the current scene is related to school according to the "classmate" relationship between two persons present at the current frame, other persons present have a teacher-student relationship with them with a high probability, thereby narrowing the candidate set of characters. Therefore, the social cue enhancement has great application potential for the human detection task.
Disclosure of Invention
The invention aims to provide a social relationship enhanced complex video character retrieval method, which combines visual information and multi-source text information to abstract out the social relationship in a video situation and provides a reliable semantic clue for character retrieval in the complex scene, thereby improving the accuracy of character retrieval results.
The purpose of the invention is realized by the following technical scheme:
a social relationship enhanced complex video person retrieval method comprises the following steps:
sampling a video to be retrieved to obtain a video frame sequence, and extracting corresponding text information from the video to be retrieved;
performing character detection and scene segmentation on the video frame sequence, and establishing a character area set contained in each scene by combining a timestamp of a video to be retrieved; constructing a corresponding relation propagation graph for each character region set, wherein the character regions are used as nodes and are called candidate nodes, the candidate nodes are connected in a complete graph mode, a category vector is initialized for each candidate node, and each candidate node is connected with all query nodes in the relation graph corresponding to the given target query character set;
for each relation propagation graph, predicting the relation and the category probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by using the visual features corresponding to the candidate nodes and the query node as well as the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the feature aggregation mode of the neighbor nodes until convergence; each dimension in the category vector represents the probability of belonging to one target query figure, the dimension number is the number of the target query figures, the maximum value of the probability in the converged category vector represents that the candidate node is the corresponding target query figure, and the image of the figure region corresponding to the candidate node and the category vector after iteration of the candidate node are used as the retrieval result of the corresponding target query figure.
According to the technical scheme provided by the invention, on one hand, the social relationship is used as a link, so that semantic information related to the target person can be fully mined and disclosed, all the field segments of the target person can be accurately searched, and a foundation can be provided for supporting other related applications; on the other hand, a more excellent complex character retrieval model can be provided, and better results can be obtained on subjective and objective indexes such as accuracy, recall rate and fluency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a model framework diagram of a complex video character retrieval method with enhanced social relationships according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a complex video character retrieval method with enhanced social relationship, which mainly comprises the following steps:
step 1, sampling a video to be retrieved to obtain a video frame sequence, and extracting corresponding text information from the video to be retrieved.
In the step, the complete video to be retrieved can be sampled at equal intervals to obtain a sequence consisting of video frames; then, extracting corresponding text information from the video to be retrieved, and carrying out denoising and time axis correction processing; wherein the text information includes: bullet screen text information and subtitle text information.
Step 2, carrying out character detection and scene segmentation on the video frame sequence, and establishing a character region set contained in each scene by combining a timestamp of a video to be retrieved; and constructing a corresponding relation propagation graph for each character region set, wherein the character regions are used as nodes and are called candidate nodes, the candidate nodes are connected in a complete graph mode, a category vector is initialized for each candidate node, and each candidate node is connected with all query nodes in the relation graph corresponding to the given target query character set.
The method mainly comprises three parts:
the first part is human detection: and detecting the character areas of each frame in the video frame sequence, extracting the visual characteristics of each character area and recording the time stamp of each character area corresponding to the video to be retrieved.
The second part is scene segmentation: based on visual style transformation, a video frame sequence is divided into a plurality of video frame segments, each video frame segment is used as a scene, and the starting and ending time stamps of the video to be retrieved corresponding to each scene are recorded.
The third part is to establish a relationship propagation diagram: firstly, establishing a character area set corresponding to each scene based on a starting and stopping time stamp of the video to be retrieved corresponding to each scene and a time stamp of the video to be retrieved corresponding to each character area, wherein the character area set corresponding to each scene comprises character areas with all time stamps within the starting and stopping time range of the scene; then, establishing a relationship propagation diagram for each scene by utilizing the character region set; the nodes in the relation propagation graph are human regions and are called candidate nodes, the edges are the similarity of the candidate nodes, and all the candidate nodes are connected with each other to form a complete graph; in addition, each candidate node maintains an iteratively updated category vector identifying the persona category to which it belongs. Meanwhile, edges pointing to each query node need to be regenerated for the candidate nodes.
Step 3, predicting the relation and the category probability between the two candidate nodes by combining the corresponding text information for each relation propagation graph; calculating the similarity between candidate nodes and between the candidate nodes and the query node by using the visual features corresponding to the candidate nodes and the query node as well as the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the feature aggregation mode of the neighbor nodes until convergence; each dimension in the category vector represents the probability of belonging to one target query figure, the dimension number is the number of the target query figures, the maximum value of the probability in the converged category vector represents that the candidate node is the corresponding target query figure, and the image of the figure region corresponding to the candidate node and the category vector after iteration of the candidate node are used as the retrieval result of the corresponding target query figure.
In this step, relation-aware label forwarding is executed with a scene as a unit, a category vector of each node in the relation propagation graph is iteratively calculated, and a person category to which each node (i.e., a person region) belongs is finally obtained. Relationship-aware tag propagation includes the following components: firstly, detecting the character relations of different nodes in the current scene, then calculating the similarity between the nodes according to the idea of seeking the maximum social relation network matching based on the detected character relations, and finally aggregating the category vectors of the adjacent nodes according to the similarity between the nodes. Seeking the maximum social relationship network follows two principles:
1) different people with specific social relationships often co-occur or alternate in a short time, and thus one person can be used to assist in identifying other people.
2) The characters in the video are correlated with the scenes, and the social relationship can be adopted as a link between the scenes and the characters so as to narrow down the candidate set of the characters which possibly appear in the current scene.
When the relation-aware labels are broadcast, label information in a relation graph among pre-labeled query nodes is introduced, and finally all scene segments of characters corresponding to the query nodes are found out from all scenes of the video to be detected.
FIG. 1 is a model framework diagram of the above method of the present invention. The top left corner is input data, which mainly includes the Video frame sequences (Video Frames) and text information (Textual Documents) obtained by processing in the step 1, and also includes a given target query person set and a corresponding relationship Graph (relationship Graph), wherein each target query person in the target query person set corresponds to a clear forward person region image, and the relationship Graph is marked with a social relationship between the target query persons; the top and the upper right part correspond to the step 2, that is, after character detection and Scene segmentation, character Regions (Detected Regions) and Scene segments (Scene segments) are obtained, a character region set (e.g., Scene units 1-n in fig. 1) included in each Scene is obtained, and related text information can be directly located according to a timestamp; then, establishing a relation propagation Graph (Graph for Scene Unit) by utilizing the character region set in each Scene, and generating an edge pointing to each Query node (Query node) by each candidate node (galery node) in the relation propagation Graph; finally, according to the relationship-aware Label Propagation method introduced in step 3, a search result is obtained, that is, the segments of the target query people shown in the lower left corner appear in all scenes of the video to be detected. It should be noted that the various types of personal images and text messages shown in fig. 1 are for illustration purposes only and are not intended to be limiting, and that the relevant personal images are from an existing public data set and are not relevant to personal privacy concerns for research efforts.
For ease of understanding, the following description will be directed to preferred embodiments of the above steps.
Firstly, preprocessing data.
In the embodiment of the invention, the processed data mainly refers to video data, text data and data related to a target query person.
For video data, a sequence of video frames can be obtained by equally spaced sampling. For example, sampling at a sampling frequency of 0.5 frames/second results in a sequence of video frames, and the specific sampling frequency may be set based on performance and efficiency tradeoffs
Extracting corresponding text information from a video to be retrieved for text data, and then carrying out denoising and time axis correction processing; the text information herein mainly considers subtitle text and bullet screen text.
In particular, for high-noise barrage text information, in order to filter out irrelevant text, regular rules may be adopted to filter symbolic characters and the like, and meanwhile, the transmission time of the barrage is corrected according to the typing speed (which is set according to actual data, generally about 30 words/minute), so as to obtain a video frame sequence and corresponding corrected text information, i.e., preprocessed data.
For the data related to the target query people, a corresponding target query people set can be generated according to the specified target query people, each target query people selects a clear forward people area image, and the social relationship among the target query people is manually marked.
And secondly, establishing a relation propagation diagram.
As previously described, this step includes three parts: person detection, scene segmentation, and graph creation. The preferred embodiment of each section is as follows:
1. and (5) detecting people.
In the embodiment of the invention, the time window in a certain range is used for screening out relevant text information according to the condition that the time of the video frame to which the character region belongs is taken as a reference (0 s). According to experience, in the embodiment of the invention, the time window of the bullet screen text is in the range of [ -10s,15s ] of the current time, and the time window of the caption text is [ -45s,45s ]. The specific time window length can be adjusted as desired. These pairs of person regions and filtered relevant text information are used as input for person detection.
Human detection can use the person detection method based on fast R-CNN to locate all human regions appearing in a sequence of video frames on a frame-by-frame, indiscriminate basis. And the human region output by FasterR-CNN is respectively sent into two channels to obtain visual characteristics of different layers. The first channel is a pedestrian re-recognition network (e.g., Cross-Level Semantic Alignment) based on multi-scale convolution, and the human region obtains an overall visual feature description through the network, which can be understood as a feature of the whole body including dimensions such as clothing, posture, stature and the like, and can be called as a re-recognition feature. The second channel is a convolutional network based on face recognition (e.g., FaceNet), which will extract its facial features (if any) from the input person region. The features of the two channels correspond to visual features of an individual region at two different levels, which may be collectively referred to as visual features.
Illustratively, the fast R-CNN can be initialized using the VGG-16 network, and then a simple classifier (whether human or not) is constructed using the fast R-CNN, and retrained on an image data set containing only human to achieve more accurate detection capability. The Cross-Level Semantic Alignment network can be initialized by using a ResNet-50 structure to obtain 1024-dimensional re-identification characteristics output by the Cross-Level Semantic Alignment network. During facial feature extraction, MTCNN can be used as a face positioning structure, and then 1024-dimensional facial features are obtained through faceNet.
2. And (4) segmenting the scene.
Pyscenedect may be selected as a scene divider that may divide the video frame sequence subjected to the preprocessing step into different segments according to the visual style, so that the video frames located in the same segment are similar in visual style (e.g., background color and image elements).
Although scenes in a complex video have great variability, so that different scenes with the same character identity have great visual differences, after a series of scenes are obtained by taking visual separation as division, the visual characteristics of the same character in the same scene are kept relatively stable, and the time sequence information in the complex video is well utilized. Meanwhile, the social relationship semantics in different scenes can be segmented by segmenting the scenes, and the purity of semantic information is improved. And finally recording the start and stop time stamp of each scene output by the scene segmentation.
3. And (5) constructing a graph.
Combining the results of the first two parts, organizing all the human regions into human region sets in different scenes according to the scene starting and stopping time, wherein each scene corresponding human region set comprises all human regions with time stamps within the starting and stopping time range of the scene.
And establishing a relationship propagation graph for the character areas in each scene, wherein the nodes of the graph are the character areas, the edges are the feature similarity of the nodes, and the features of the nodes comprise re-identification vectors and face vectors. Meanwhile, each node maintains an iteratively updated category vector and identifies the figure category to which the node belongs; the stages in the relationship propagation graph are referred to as candidate nodes.
Target-keeping query person set Q ═ { Q ═ Q1,q2,...,qnWherein q isiThe character area representing the ith inquiry character is obtained by the way introduced in the character detection part to re-identify the characteristics and the face characteristics, and each node q is connected with the character areaiRegarding the query node as one, n represents the number of query nodes in the target query person set. Recording the candidate node set G ═ G in the single relationship propagation graph1,g2,...,gmAnd m represents the number of candidate nodes in a single relationship propagation graph. The generation rule of the edge is as follows: all candidate nodes belonging to the candidate node set G are connected in a full graph manner and are each a candidate node GjAll generate n points q1,q2,...,qnThe edge of (1); to obtain finally
Figure BDA0003121552720000071
Edges and m + n nodes.
At the same time, a category vector (tag vector) is initialized for each node to identify the person identity of its corresponding person region. For query node qiE.g. Q, we label itVector initialization to pi(maintained fixed, not updated) representing the ith dimension of 1 n-dimensional one-hot coding; for all candidate nodes, their class vectors are initialized to 0 vectors since their identities are unknown. In the following description, q is used as a query node, g is used as a candidate node, and the indices of q and g are used to identify different query nodes or candidate nodes.
And thirdly, label propagation of relation perception.
The relation perception label propagation aims to aggregate label information of neighbor nodes of all nodes in a graph, and not only utilizes timeliness of a scene, but also integrates relation clues and visual features to supplement each other. This phase is also divided into three sections: and (4) carrying out relationship detection, relationship perception node similarity calculation and label propagation.
1. And (5) detecting the relation.
For any two candidate nodes g in a single relationship propagation graphkAnd gjAnd judging the social relationship between the candidate nodes by fusing the bullet screen and the image information. All the barrages and subtitles with the time stamps in the scene are collected to obtain a text representation e of the tf-idf vector of the barrages and subtitlestSimultaneously, RudeCarnie is adopted to extract g respectivelykAnd glIs expressed as fc7 output
Figure BDA0003121552720000072
As characterized by gender and age. After that, the vectors of the dimensions are mapped and spliced, and then a relation prediction result r of the vectors is obtained through a calibrated support vector machine (calibrated SVM)kjAnd class probability pkj
Figure BDA0003121552720000073
Wherein the content of the first and second substances,
Figure BDA0003121552720000074
is two candidate nodes gkAnd gjThe sex of the male sex organ of (1),
Figure BDA0003121552720000075
is two candidate nodes gkAnd gjIs used for the age characterization of (1).
In the embodiment of the invention, text information in a scene is fused into a document, and tf-idf (term frequency-inverse document frequency) vectors of the document are extracted as text representations of the whole scene, so that different candidate nodes are brought into the same e when calculating the relation and the class probabilityt
Rudecarie is a model proposed in the prior art for extracting gender and age characteristics; fc7 is the name of a particular neural network hierarchy in rude carnie, referring to the fully connected layer labeled 7.
In the embodiment of the invention, the relationship mainly refers to social relationship, such as relatives, couples, friends, enemies, colleagues and the like; each dimension of the category probability represents the probability that two candidate nodes belong to a respective social relationship category, and the socialization of the two candidate nodes is the category corresponding to the highest probability in the category probabilities.
2. And calculating the similarity of the nodes perceived by the relationship.
In the embodiment of the invention, the relation of the characters is integrated into the similarity calculation of the nodes so as to correct the visual information of the high-level semantic information and play a complementary role at the same time. In calculating the node similarity, the social relationship is taken as a guide in two aspects: 1) people with a particular social relationship should co-occur or appear multiple times in the same scene; 2) the social relationship may act as a tie between the person and the scene to reduce the candidate set of people that may be present in the current scene.
In the embodiment of the invention, the similarity is divided into two parts of visual similarity and relationship guidance similarity, and the formula for calculating the similarity between nodes is as follows:
Sim(nx,ny)=Simv(nx,ny)+αr*Simr(nx,ny)
wherein alpha isrWeight (for example, it can be set to 1.2) indicating the degree of similarity of the relationship, SimvExpressing the visual similarity, and calculating by using the visual characteristics of the two nodes; simrRepresenting the relationship guide similarity, and calculating by using the relationship and the category probability among the candidate nodes; n isxAnd nyRepresenting two nodes, wherein when one node is a candidate node, the other node is a candidate node or a query node; that is, the similarity calculation is divided into two cases, i.e., the similarity calculation between the candidate nodes, and the similarity calculation between the candidate nodes and the query node.
In both cases, the visual similarity SimrThe same way of calculation, i.e. fusing its re-identification features eidCosine similarity and facial features e offaceThe cosine similarity of (a) is calculated, and the formula is as follows:
Figure BDA0003121552720000081
wherein alpha isbA weight representing the feature (which may be set to 0.2),
Figure BDA0003121552720000082
and
Figure BDA0003121552720000083
representing the re-identification features corresponding to the two nodes,
Figure BDA0003121552720000086
and
Figure BDA0003121552720000085
friend facial features corresponding to two nodes, Cos (.) representing cosine similarity between features; and re-identifying the features and the facial features as the visual features corresponding to the nodes.
The relationship guidance similarity Sim is described in detail belowrThe calculation process of (c) is, as mentioned above, two cases: query node qiAnd candidate node gjCalculating the similarity between the two groups; and candidate node giAnd gjThe similarity between them is calculated.
1) For query node qiAnd candidate node gjThe calculating step of the relationship guidance similarity comprises the following steps:
extracting candidate nodes gjWith all candidate nodes gjNeighbor candidate node g in a relationship propagation graphk∈κ(gj) And the class probability (i.e., obtained by the relationship detection section):
Figure BDA0003121552720000091
wherein r isjkAnd pjkEach representing a candidate node gjWith its neighbor node gkPredicting the result and the class probability by the relationship between the two; SVM stands for calibration support vector machine; e.g. of the typetA text representation vector representing the bullet screen text information in the scene where the candidate node is located;
Figure BDA0003121552720000095
and
Figure BDA0003121552720000093
represents a candidate node gjWith its neighbor node gkIs a sex and age characterizing vector, κ (g)j) Represents a candidate node gjIs determined.
According to a pre-labeled relationship graph G between query nodesrFinding all nodes q in the query nodeiForm ReljSet of query nodes of the same relationship:
Qij={(qu,rjk)|Gr(qi,qu)=rjk,(rjk,pjk)∈Relj}
wherein G isr(qi,qu) Is represented in a relation graph GrThe query results in a query node pair (q)i,qu) (ii) the relationship formed;
calculating a candidate node g by the following formulajAnd query node qiIs onThe system index similarity:
Simr(qi,gj)=max{pjk*Simv(qu,gk)|(rjk,pjk)∈Relj,(qu,rjk)∈Qij}。
2) for candidate node gsAnd gjThe relationship guidance similarity is calculated in a similar mode, and the calculating step comprises the following steps:
extracting candidate nodes gsAnd gjRelation r ofsjAnd class probability psjBy querying a relationship graph G between pre-labeled query nodesrAnd obtaining a query node pair set with the same relation in the query nodes:
Q′sj={(qt,ql)|Gr(qt,ql)=rsj}
wherein G isr(qt,ql) Is shown in a relation diagram GrThe query results in a query node pair (q)t,ql) (ii) the relationship formed;
set Q 'with query node pairs'sjGet candidate node gsAnd gjAt a corresponding one of the query nodes:
Figure BDA0003121552720000094
calculating a candidate node g by the following formulajAnd query node qiThe relationship of (a) directs similarity:
Simr(gs,gj)=psj*Simv(qt′,ql′)。
3. and (5) label propagation.
After the similarity of social relation perception is calculated, label propagation in the same scene is utilized, time sequence information in the same scene can be combined, the problem of semantic purity reduction caused by different scenes is solved, and a relatively accurate character retrieval result is obtained by iteratively updating the category vector of each candidate node.
In the embodiment of the invention, the category vector of each candidate node is updated by combining a neighbor node characteristic aggregation mode; the following two updating modes are mainly provided, and one of the two updating modes can be selected:
the first update formula is:
Figure BDA0003121552720000101
Figure BDA0003121552720000102
wherein the content of the first and second substances,
Figure BDA0003121552720000103
represents a candidate node gjThe category vector at iteration t +1,
Figure BDA0003121552720000104
represents a candidate node gjSet of neighbor candidate nodes k (g)j) Middle candidate node gsCategory vectors at the t-th iteration; omegajsRepresenting a relationship weight calculated using a similarity between nodes;
in the second mode, for
Figure BDA0003121552720000105
And the value of the c dimension is updated by the maximum confidence coefficient to reduce the influence caused by the noise, and the formula is as follows:
Figure BDA0003121552720000106
Figure BDA0003121552720000107
wherein the content of the first and second substances,
Figure BDA0003121552720000108
represents a candidate node gjSet of neighbor candidate nodes k (g)j) Middle candidate node gsClass vector at the t-th iteration
Figure BDA0003121552720000109
Value of the c-th dimension.
Generally speaking, the iteration continues for about 10 to 20 rounds, convergence can be achieved, the largest dimension in the category vector of each candidate node is finally selected as the person identity identification, each dimension in the category vector represents the probability of belonging to a target query person, the dimension number is the number of query nodes (target query persons), that is, the maximum value of the probability represents that the candidate node is a certain target query person, and the image of the person region corresponding to the candidate node and the category vector after the iteration of the candidate node are used as the retrieval result of the corresponding target query person.
According to the scheme of the embodiment of the invention, the existing data can be fully utilized to learn the character retrieval model with good effect (referred to as the scheme of the invention), and the target character in the specific complex video is searched on the basis, so that the social relationship is utilized to improve the effect of the pure visual character retrieval model.
Those skilled in the art will appreciate that the identity of the character in the training data is known; the training process mainly comprises the following steps: the pedestrian re-recognition network and the face recognition convolutional network (i.e. two network models for extracting re-recognition features and face features), and the training method adopts a traditional classification method, which is not repeated.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for searching a complex video character with enhanced social relationship is characterized by comprising the following steps:
sampling a video to be retrieved to obtain a video frame sequence, and extracting corresponding text information from the video to be retrieved;
performing character detection and scene segmentation on the video frame sequence, and establishing a character area set contained in each scene by combining a timestamp of a video to be retrieved; constructing a corresponding relation propagation graph for each character region set, wherein the character regions are used as nodes and are called candidate nodes, the candidate nodes are connected in a complete graph mode, a category vector is initialized for each candidate node, and each candidate node is connected with all query nodes in the relation graph corresponding to the given target query character set;
for each relation propagation graph, predicting the relation and the category probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by using the visual features corresponding to the candidate nodes and the query node as well as the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the feature aggregation mode of the neighbor nodes until convergence; each dimension in the category vector represents the probability of belonging to one target query figure, the dimension number is the number of the target query figures, the maximum value of the probability in the converged category vector represents that the candidate node is the corresponding target query figure, and the image of the figure region corresponding to the candidate node and the category vector after iteration of the candidate node are used as the retrieval result of the corresponding target query figure.
2. The method of claim 1, wherein the step of sampling the video to be retrieved to obtain a sequence of video frames and extracting corresponding text information from the video to be retrieved comprises:
for a video to be retrieved, obtaining a video frame sequence by adopting an equidistant sampling mode;
extracting corresponding text information from a video to be retrieved, and carrying out denoising and time axis correction processing; wherein the text information includes: bullet screen text information and subtitle text information.
3. The method as claimed in claim 1, wherein the step of performing character detection and scene segmentation on the video frame sequence, and establishing a character region set included in each scene by combining the time stamp of the video to be retrieved comprises:
detecting a person region for each frame in a video frame sequence, extracting visual features of each person region and recording a time stamp of each person region corresponding to a video to be retrieved;
based on visual style transformation, dividing a video frame sequence into a plurality of video frame segments, taking each video frame segment as a scene, and recording a start-stop timestamp of a video to be retrieved corresponding to each scene;
and establishing a character area set corresponding to each scene based on the starting and stopping time stamp of the video to be retrieved corresponding to each scene and the time stamp of the video to be retrieved corresponding to each character area.
4. The method of claim 1, wherein the visual features of the candidate node corresponding to the query node comprise two types of features: re-identifying the features and the facial features;
the figure region image corresponding to the candidate node and the target query figure image corresponding to the query node are respectively input to a pedestrian re-identification network based on multi-scale convolution and a convolution network based on face identification, and the two types of features are extracted.
5. The method of claim 1, wherein the formula for predicting the relation and category probability between two candidate nodes by using the corresponding text information is:
Figure FDA0003121552710000021
wherein r iskjAnd pkjEach representing two candidate nodes gkAnd gjPredicting the result and the class probability by the relationship between the two; SVM stands for calibration support vector machine; e.g. of the typetA text representation vector representing text information in a scene where the candidate node is located;
Figure FDA0003121552710000022
is two candidate nodes gkAnd gjThe sex of the male sex organ of (1),
Figure FDA0003121552710000023
is two candidate nodes gkAnd gjIs used for the age characterization of (1).
6. The method of claim 1, wherein the formula for calculating the similarity between the candidate nodes and the query node is as follows:
Sim(nx,ny)=Simv(nx,ny)+αr*Simr(nx,ny)
wherein n isxAnd nyRepresenting two nodes, one of which is a candidate node, and the other is a candidate node or a query nodePoint; alpha is alpharWeight, Sim, representing degree of similarity of relationshipvExpressing the visual similarity, and calculating by using the visual characteristics of the two nodes; simrAnd expressing the relationship guide similarity, and calculating by using the relationship and the category probability among the candidate nodes.
7. The method of claim 6, wherein the formula for calculating the visual similarity is as follows:
Figure FDA0003121552710000024
wherein alpha isbThe weight is represented by a weight that is,
Figure FDA0003121552710000025
and
Figure FDA0003121552710000026
representing the re-identification features corresponding to the two nodes,
Figure FDA0003121552710000027
and
Figure FDA0003121552710000028
representing facial features corresponding to two nodes, and Cos (eta) representing cosine similarity between the features; and re-identifying the features and the facial features as the visual features corresponding to the nodes.
8. The method of claim 6, wherein if two nodes are the candidate node and the query node, the candidate node is marked as gjAnd query node qiThen, the step of calculating the relationship guidance similarity includes:
extracting candidate nodes gjWith all candidate nodes gjNeighbor candidate node g in a relationship propagation graphk∈κ(gj) And class probability:
Figure FDA0003121552710000031
wherein r isjkAnd pjkEach representing a candidate node gjWith its neighbor node gkPredicting the result and the class probability by the relationship between the two; SVM stands for calibration support vector machine; e.g. of the typetA text representation vector representing the bullet screen text information in the scene where the candidate node is located;
Figure FDA0003121552710000032
and
Figure FDA0003121552710000033
represents a candidate node gjWith its neighbor node gkGender and age characterization vectors of; kappa (g)j) Represents a candidate node gjThe neighbor candidate node set of (2);
according to a pre-labeled relationship graph G between query nodesrFinding all nodes q in the query nodeiForm ReljSet of query nodes of the same relationship:
Qij={(qu,rjk)|Gr(qi,qu)=rjk,(rjk,pjk)∈Relj}
wherein G isr(qi,qu) Is represented in a relation graph GrThe query results in a query node pair (q)i,qu) (ii) the relationship formed;
calculating a candidate node g by the following formulajAnd query node qiThe relationship of (a) directs similarity:
Simr(qi,gj)=max{pjk*Simv(qu,gk)|(rjk,pjk)∈Relj,(qu,rjk)∈Qij}。
9. the method as claimed in claim 6, wherein if both nodes are candidate nodes, the candidate node g is marked as the candidate node gsAnd gjThen, the step of calculating the relationship guidance similarity includes:
extracting candidate nodes gsAnd gjRelation r ofsjAnd class probability psjBy querying a relationship graph G between pre-labeled query nodesrAnd obtaining a query node pair set with the same relation in the query nodes:
Q′sj={(qt,ql)|Gr(qt,ql)=rsj}
wherein G isr(qt,ql) Is shown in a relation diagram GrThe query results in a query node pair (q)t,ql) (ii) the relationship formed;
set Q 'with query node pairs'ijGet candidate node gsAnd gjAt a corresponding one of the query nodes:
Figure FDA0003121552710000036
calculating a candidate node g by the following formulajAnd query node qiThe relationship of (a) directs similarity:
Simr(gs,gj)=pij*Simv(qk′,ql′)。
10. the method for searching for the complex video character with the enhanced social relationship as claimed in claim 6, wherein the category vector of each candidate node is updated in combination with the neighbor node feature aggregation, and any one of the following ways is adopted:
in the first way, the formula is updated as follows:
Figure FDA0003121552710000034
Figure FDA0003121552710000035
wherein the content of the first and second substances,
Figure FDA0003121552710000041
represents a candidate node gjThe category vector at iteration t +1,
Figure FDA0003121552710000042
represents a candidate node gjSet of neighbor candidate nodes k (g)j) Middle candidate node gsCategory vectors at the t-th iteration; omegajsRepresenting a relationship weight calculated using a similarity between nodes;
in the second mode, for
Figure FDA0003121552710000043
The value of the c-th dimension is updated by the maximum confidence, and the formula is as follows:
Figure FDA0003121552710000044
Figure FDA0003121552710000045
wherein the content of the first and second substances,
Figure FDA0003121552710000046
representing a category vector
Figure FDA0003121552710000047
Value of the c-th dimension.
CN202110677925.XA 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship Active CN113343029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110677925.XA CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110677925.XA CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Publications (2)

Publication Number Publication Date
CN113343029A true CN113343029A (en) 2021-09-03
CN113343029B CN113343029B (en) 2024-04-02

Family

ID=77477338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110677925.XA Active CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Country Status (1)

Country Link
CN (1) CN113343029B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113676776A (en) * 2021-09-22 2021-11-19 维沃移动通信有限公司 Video playing method and device and electronic equipment
CN117201873A (en) * 2023-11-07 2023-12-08 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140193048A1 (en) * 2011-09-27 2014-07-10 Tong Zhang Retrieving Visual Media
CN111061915A (en) * 2019-12-17 2020-04-24 中国科学技术大学 Video character relation identification method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140193048A1 (en) * 2011-09-27 2014-07-10 Tong Zhang Retrieving Visual Media
CN111061915A (en) * 2019-12-17 2020-04-24 中国科学技术大学 Video character relation identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
綦金玮;彭宇新;袁玉鑫;: "面向跨媒体检索的层级循环注意力网络模型", 中国图象图形学报, no. 11 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113676776A (en) * 2021-09-22 2021-11-19 维沃移动通信有限公司 Video playing method and device and electronic equipment
CN113676776B (en) * 2021-09-22 2023-12-26 维沃移动通信有限公司 Video playing method and device and electronic equipment
CN117201873A (en) * 2023-11-07 2023-12-08 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image
CN117201873B (en) * 2023-11-07 2024-01-02 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image

Also Published As

Publication number Publication date
CN113343029B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US9176987B1 (en) Automatic face annotation method and system
Lei et al. Detecting moments and highlights in videos via natural language queries
CN112015949B (en) Video generation method and device, storage medium and electronic equipment
CN114342353B (en) Method and system for video segmentation
US9253511B2 (en) Systems and methods for performing multi-modal video datastream segmentation
US7860347B2 (en) Image-based face search
US9589205B2 (en) Systems and methods for identifying a user's demographic characteristics based on the user's social media photographs
CN113158023B (en) Public digital life accurate classification service method based on mixed recommendation algorithm
US20110243529A1 (en) Electronic apparatus, content recommendation method, and program therefor
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
Sreeja et al. Towards genre-specific frameworks for video summarisation: A survey
US20150019206A1 (en) Metadata extraction of non-transcribed video and audio streams
CN113343029B (en) Complex video character retrieval method with enhanced social relationship
Jou et al. Structured exploration of who, what, when, and where in heterogeneous multimedia news sources
Lv et al. Storyrolenet: Social network construction of role relationship in video
Chen et al. Name-face association in web videos: A large-scale dataset, baselines, and open issues
Acar et al. Breaking down violence detection: Combining divide-et-impera and coarse-to-fine strategies
Baghel et al. Image conditioned keyframe-based video summarization using object detection
Narwal et al. A comprehensive survey and mathematical insights towards video summarization
Li et al. Social context-aware person search in videos via multi-modal cues
Röthlingshöfer et al. Self-supervised face-grouping on graphs
CN110287376A (en) A method of the important vidclip of extraction based on drama and caption analysis
Dai et al. Two-stage model for social relationship understanding from videos
Kaushal et al. Demystifying multi-faceted video summarization: tradeoff between diversity, representation, coverage and importance
Qu et al. Semantic movie summarization based on string of IE-RoleNets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant