CN113343029B - Complex video character retrieval method with enhanced social relationship - Google Patents

Complex video character retrieval method with enhanced social relationship Download PDF

Info

Publication number
CN113343029B
CN113343029B CN202110677925.XA CN202110677925A CN113343029B CN 113343029 B CN113343029 B CN 113343029B CN 202110677925 A CN202110677925 A CN 202110677925A CN 113343029 B CN113343029 B CN 113343029B
Authority
CN
China
Prior art keywords
nodes
candidate
node
query
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110677925.XA
Other languages
Chinese (zh)
Other versions
CN113343029A (en
Inventor
徐童
陈恩红
李丹
周培伦
何伟栋
郝艳宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110677925.XA priority Critical patent/CN113343029B/en
Publication of CN113343029A publication Critical patent/CN113343029A/en
Application granted granted Critical
Publication of CN113343029B publication Critical patent/CN113343029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

On one hand, the social relationship is used as a tie, so that semantic information related to a target person can be fully mined and revealed, all the outgoing fragments of the target person can be accurately searched, and a foundation can be provided for supporting other related applications; on the other hand, the method can provide a more excellent complex character extraction model, and can obtain better results on subjective and objective indexes such as accuracy, recall rate and fluency.

Description

Complex video character retrieval method with enhanced social relationship
Technical Field
The invention relates to the technical field of computer vision, in particular to a method for searching complex video characters with enhanced social relations.
Background
Person retrieval in a complex video is an important issue in video analysis, which is to extract all out-of-field segments containing a particular person in the video from a complete video. With the rise of mass emerging video media platforms, application scenes of the video media platforms in reality are wider. For some movie lovers or viewers loving a particular character, they may be willing to make special character-oriented summaries, such as clips of a particular star's attendance at a particular movie work, etc. And how to utilize computer technology to automatically and effectively extract and understand information in charge of video content, so as to search out target characters, better help people to quickly and accurately understand video content, and have good application value. Many intelligent video analysis functions have appeared in the current mainstream video media platform, so that users can more conveniently and quickly inquire and understand video contents. For example, the function of 'watching only he' proposed by the you ku video and the ai qi video can automatically generate clips of specific video characters according to user preference, thereby realizing concentration and key information extraction of massive video data.
However, the conventional person retrieval methods are based on visual characteristics, but rarely use the video to enrich high-level semantic information consisting of images and texts. In fact, in addition to the visual features formed by the video frames, the video also contains a large amount of different types of text information, such as subtitles and barrages, and the visual and text information can jointly reveal the scene information of the current segment, so that reliable high-level semantic clues can be provided when the visual information quality is low. Therefore, if the high-level semantic clues can be formally described and the person in the complex video is assisted to search, the task of the person-oriented video abstraction can be better completed. More importantly, when the semantic information is combined with the social relationship of the person, the social context formed by the semantic information provides a very important clue for the retrieval of the person. For example, if it can be determined that the current scene is related to school based on the "classmate" relationship between two persons present in the current frame, then other present persons have a high probability of having a teacher-student relationship with them, thereby narrowing down the candidate set of persons. It follows that social cue enhancement has great potential for the human detection task.
Disclosure of Invention
The invention aims to provide a complicated video character retrieval method with enhanced social relation, which abstracts the social relation in the video situation by combining visual information and multi-source text information, and provides reliable semantic clues for character retrieval in a complicated scene, thereby improving the accuracy of character retrieval results.
The invention aims at realizing the following technical scheme:
a method for searching complex video characters with enhanced social relationship comprises the following steps:
sampling the video to be searched to obtain a video frame sequence, and extracting corresponding text information from the video to be searched;
performing person detection and scene segmentation on the video frame sequence, and establishing a person region set contained in each scene by combining a timestamp of the video to be searched; constructing a corresponding relation propagation graph for each character area set, wherein the character areas are used as nodes, called candidate nodes, the candidate nodes are connected in a complete graph mode, initializing a category vector for each candidate node, and connecting each candidate node with all query nodes in the relation graph corresponding to a given target query character set;
for each relation propagation graph, predicting the relation and class probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character.
According to the technical scheme provided by the invention, on one hand, the semantic information related to the target person can be fully mined and revealed by taking the social relationship as the tie, so that all the outgoing fragments of the target person can be accurately searched, and a foundation can be provided for supporting other related applications; on the other hand, the method can provide a more excellent complex character retrieval model, and can obtain better results on subjective and objective indexes such as accuracy, recall rate and fluency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a model framework diagram of a social relationship enhanced complex video personage retrieval method provided by an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides a complex video character retrieval method with enhanced social relationship, which mainly comprises the following steps:
step 1, sampling a video to be searched to obtain a video frame sequence, and extracting corresponding text information from the video to be searched.
In the step, the complete video to be searched can be sampled at equal intervals to obtain a sequence consisting of video frames; then, extracting corresponding text information from the video to be retrieved, and carrying out denoising and time axis correction processing; wherein, the text information includes: bullet screen text information and subtitle text information.
Step 2, performing person detection and scene segmentation on the video frame sequence, and establishing a person region set contained in each scene by combining a timestamp of the video to be retrieved; and constructing a corresponding relation propagation graph for each character area set, wherein the character areas are used as nodes, are called candidate nodes, the candidate nodes are connected in a complete graph mode, a category vector is initialized for each candidate node, and each candidate node is connected with all query nodes in the relation graph corresponding to the given target query character set.
The method mainly comprises three parts:
the first part is the detection of human beings: and detecting the person region for each frame in the video frame sequence, extracting the visual characteristics of each person region and recording the time stamp of each person region corresponding to the video to be searched.
The second part is scene segmentation: based on visual style transformation, a video frame sequence is divided into a plurality of video frame fragments, each video frame fragment is used as a scene, and the start and stop time stamp of each scene corresponding to the video to be searched is recorded.
The third part is to build a relational propagation diagram: firstly, establishing a character region set corresponding to each scene based on a start-stop time stamp of a video to be searched corresponding to each scene and a time stamp of a video to be searched corresponding to each character region, wherein each character region set corresponding to each scene comprises character regions with all time stamps within a start-stop time range of the scene; then, establishing a relation propagation diagram for each scene by utilizing the character region set; nodes in the relation propagation graph are human areas, which are called candidate nodes, edges are the similarity of the candidate nodes, and all the candidate nodes are connected with each other to form a complete graph; in addition, each candidate node maintains an iteratively updated class vector identifying the persona class to which it belongs. At the same time, edges pointing to each query node need to be regenerated for the candidate nodes.
Step 3, predicting the relation and class probability between two candidate nodes by combining corresponding text information for each relation propagation diagram; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character.
In this step, in the scene unit, the label relay of the relation perception is executed, and the category vector of each node in the relation propagation graph is calculated iteratively, so as to finally obtain the belonging person category of each node (namely the person region). The label propagation of relational awareness involves the following parts: firstly, detecting the person relations of different nodes in the current scene, then calculating the similarity between the nodes based on the detected person relations according to the thought of searching for the maximum social relation network matching, and finally aggregating the category vectors of the adjacent nodes according to the similarity between the nodes. Seeking the maximum social relationship network follows a two-point principle:
1) Different people with a specific social relationship often co-occur or alternate in short time, from which one person can be used to assist in identifying other people.
2) The characters in the video are interrelated with the scene, and social relationships can be used as links between the scene and the characters to reduce the candidate set of characters that may appear in the current scene.
When the relation-aware label is rebroadcast, label information in a relation graph among the pre-labeled query nodes is introduced, and finally, all scene fragments of people corresponding to the query nodes are found out from all scenes of the video to be detected.
As shown in fig. 1, a model frame diagram of the above method of the present invention is shown. The upper left corner is input data, and mainly comprises a Video frame sequence (Video Frames) and text information (Textual Documents) obtained by processing in the step 1, and further comprises a given target query character set and a corresponding Relation Graph (Relation Graph), wherein each target query character in the target query character set corresponds to a clear forward character region image, and social relations among the target query characters are marked in the Relation Graph; the top and the upper right corner correspond to the step 2, namely, after the person detection and the Scene segmentation processing, person Regions (Detected Regions) and Scene segments (Scene segments) are obtained, so as to obtain a person region set (such as Scene units 1-n in fig. 1) contained in each Scene, and related text information can be directly positioned according to the time stamp; thereafter, a relationship propagation graph (Graph for Scene Unit) is established using the set of persona areas in each scene, and each candidate node (Query node) in the relationship propagation graph generates an edge pointing to each Query node (Query node); finally, according to the Relation-aware label relaying (Relation-aware Label Propagation) method introduced in the step 3, obtaining a search result, namely, fragments of each target query person shown in the lower left corner in all scenes of the video to be detected. It should be noted that the various character images and text information shown in fig. 1 are for illustration only and not for limitation, and that the related character images are from existing public data sets, and that the use of the related character images for scientific research does not involve personal privacy.
For ease of understanding, preferred embodiments of the steps described above are described below.
1. And (5) preprocessing data.
In the embodiment of the invention, the processed data mainly refer to video data, text data and related data of target query characters.
For video data, a sequence of video frames may be obtained by equidistant sampling. Exemplary, sampling at a sampling rate of 0.5 frames/second results in a sequence of video frames, the particular sampling rate being set based on performance and efficiency trade-offs
For text data, extracting corresponding text information from the video to be retrieved, and then carrying out denoising and time axis correction processing; the text information herein mainly considers subtitle text and barrage text.
In particular, for high noise barrage text information, to filter out extraneous text, a regular rule may be used to filter out symbolic characters, etc., while correcting the barrage transmission time according to typing speed (typically about 30 words/min, as set by actual data), thereby obtaining a video frame sequence and corresponding corrected text information, i.e., preprocessed data.
For the related data of the target query characters, a corresponding target query character set can be generated according to the designated target query characters, each target query character selects a clear forward character region image, and social relations among the target query characters are marked manually.
2. And establishing a relation propagation diagram.
As described previously, this step includes three parts: person detection, scene segmentation, and graph creation. The preferred embodiments of the various parts are as follows:
1. and (5) detecting the person.
In the embodiment of the invention, according to the time of the video frame of the character area as a reference (0 s), a time window with a certain range is used for screening out the related text information. According to experience, in the embodiment of the invention, the time window of the bullet screen text is within the range of [ -10s,15s ] of the moment of the current frame, and the time window of the subtitle text is [ -45s,45s ]. The specific time window length can be adjusted as desired. These character areas and the filtered related text information pairs are used as input for character detection.
Person detection all person regions that appear in a sequence of video frames can be located frame-by-frame, indiscriminately, using the person detection method based on the fast R-CNN. And the character areas output by the FasterR-CNN are respectively sent into two channels to obtain visual characteristics of different layers. The first channel is a pedestrian re-recognition network (e.g., cross-Level Semantic Alignment) based on multi-scale convolution, through which the character region obtains an overall visual characteristic description, which can be understood as a characteristic of the whole body including dimensions of clothing, posture, stature, etc., and can be referred to as a re-recognition characteristic. The second channel is a face recognition based convolutional network (e.g., faceNet) that will extract its facial features (if any) from the input character region. The features of the two channels correspond to the visual feature descriptions of a person region at two different levels, which may be collectively referred to as visual features.
Illustratively, the fast R-CNN can be initialized using VGG-16 network, and then a simple classifier (whether human is involved) can be built using the fast R-CNN, and retrained on an image dataset containing only human beings in hopes of more accurate detection capability. The Cross-Level Semantic Alignment network can be initialized using the ResNet-50 structure to obtain its output 1024-dimensional re-identification characteristics. When facial features are extracted, MTCNN can be used as a face positioning structure, and 1024-dimensional facial features are obtained through FaceNet.
2. Scene segmentation.
PySceneDetect may be chosen as a scene divider that may divide the sequence of pre-processed video frames into different segments according to the visual style such that video frames located in the same segment are similar in visual style (e.g. background color and image elements).
Although scenes in the complex video have extremely variability, so that different appearances of the same person can have very large visual differences, after a series of scenes are obtained by taking visual separation as division, the visual characteristics of the same person in the same scene can be kept relatively stable, and time sequence information in the complex video is better utilized. Meanwhile, the social relation semantics in different scenes can be separated by dividing the scenes, so that the purity of the semantic information is improved. Finally, the start-stop time stamp of each scene output by the scene segmentation is recorded.
3. And (5) constructing a graph.
And combining the results of the first two parts, organizing all the character areas into character area sets in different scenes according to the start-stop time of the scenes, wherein each scene corresponds to the character area set and comprises all character areas with time stamps in the start-stop time range of the scenes.
And establishing a relation propagation diagram for the character areas in each scene, wherein the nodes of the diagram are character areas, the edges are the feature similarity of the nodes, and the features of the nodes comprise re-recognition vectors and face vectors. Simultaneously, each node maintains an iteratively updated class vector, and the person class to which the node belongs is identified; the stages in the relationship propagation graph are referred to as candidate nodes.
Remembering the target query person set q= { Q 1 ,q 2 ,...,q n }, where q i The character region representing the ith query character is obtained by the way of introduction of the previous character detection part, the re-identification characteristic and the facial characteristic of the character region are obtained, and each node q is obtained i Considered as a query node, n represents the number of query nodes in the target query persona set. Recording candidate node set G= { G in single relation propagation diagram 1 ,g 2 ,...,g m M represents the number of candidate nodes in a single relationship propagation graph. The rule for generating the edges is as follows: all candidate nodes belonging to the candidate node set G are connected in a complete graph and for each candidate node G j All regenerate n points q 1 ,q 2 ,...,q n Is a side of (2); finally obtainStripe edges and m+n nodes.
At the same time, a class vector (tag vector) is initialized for each node to identify the person's identity of its corresponding person's region. For query node q i E Q, we initialize its tag vector to p i (kept fixed, not updated), representing an n-dimensional one-hot code of dimension 1 of i; for all candidate nodes, their class vectors are initialized to 0 vector because their identity is unknown. In the description that follows, q is used as a query node, g is used as a candidate node, and the subscripts of q and g are used to identify different query nodes or candidate nodes.
3. Relationship-aware tag propagation.
The label propagation of the relation perception aims at aggregating label information of neighbor nodes of each node in the graph, so that timeliness of a scene is utilized, and a relation clue and visual characteristics are fused to complement each other. This phase is also divided into three parts: and (3) relation detection, node similarity calculation of relation perception and label propagation.
1. And (5) detecting the relationship.
For a single relation propagation graph, any two candidate nodes g k And g is equal to j And judging social relations among the candidate nodes by fusing the barrage and the image information. We collect together all the time-stamped barrages and subtitles within the scene to get the text representation e of its tf-idf vector t Meanwhile, g is respectively extracted by RudeCarnie k And g is equal to l Gender and age characteristics of (f) as fc7 outputAs a sex and age characterization. After that, the vectors with the several dimensions are mapped and spliced, and then the relation prediction result r is obtained through a calibrated support vector machine (calibrated SVM) kj And class probability p kj
Wherein,for two candidate nodes g k And g is equal to j Sex characterization of->For two candidate nodes g k And g is equal to j Age characterization of (c).
In the embodiment of the invention, text information in a scene is fused into a document, and tf-idf (term frequency-inverse document frequency) of the document is extractedRate) vector as a text representation of the entire scene, so that different candidate nodes are brought into the same e when calculating the relation and class probabilities t
RudeCarnie is a model proposed in the prior art for extracting gender and age characteristics; fc7 is the name of a particular neural network hierarchy in RudeCarnie, and is the fully connected layer with index number 7.
In the embodiment of the invention, the relationship mainly refers to social relationship, such as relatives, couples, friends, enemies, colleagues and the like; each dimension of the category probability represents the probability that two candidate nodes belong to the corresponding social relationship category, and the social relationship of the two candidate nodes is the category corresponding to the highest probability in the category probabilities.
2. And (5) calculating the node similarity of the relation perception.
In the embodiment of the invention, the relationship of the characters is integrated into the similarity calculation of the nodes so as to correct the visual information by the high-level semantic information and simultaneously play a complementary role. When node similarity is calculated, social relations are used as directions in two aspects: 1) People with a particular social relationship should co-occur or appear multiple times in the same scene; 2) Social relationships may act as a tie of people to the scene to reduce the candidate set of people that may appear in the current scene.
In the embodiment of the invention, the similarity is divided into two parts of visual similarity and relationship guiding similarity, and the formula for calculating the similarity between the nodes is as follows:
Sim(n x ,n y )=Sim v (n x ,n y )+α r *Sim r (n x ,n y )
wherein alpha is r Weights (e.g., can be set to 1.2) representing relationship similarity, sim v Representing visual similarity, and calculating by using visual features of the two nodes; sim (Sim) r Representing the relationship guiding similarity, and calculating by using the relationship and the category probability between the candidate nodes; n is n x And n y Representing two nodes, wherein when one is a candidate node, the other is a candidate node or a query node; that is, the similarity calculation is divided into two cases,i.e., similarity calculation between candidate nodes, and similarity calculation between candidate nodes and query nodes.
In both cases, the visual similarity Sim r The calculation modes are the same, namely the re-identification characteristic e is fused id Cosine similarity of (2) and facial features e face The cosine similarity of (2) is calculated by the formula:
wherein alpha is b The weight representing the feature (which may be set to 0.2),and->Representing the re-identification characteristics corresponding to the two nodes,and->The friends show facial features corresponding to the two nodes, and Cos () represents cosine similarity between the features; the re-recognition feature corresponds to the facial feature as a visual feature of the node.
The relationship index similarity Sim is described in detail below r As previously described, it is two cases: query node q i And candidate node g j Similarity calculation between the two; candidate node g i And g is equal to j And (5) calculating the similarity between the two.
1) For query node q i And candidate node g j The calculating step of the relation guide similarity comprises the following steps:
extracting candidate node g j And all candidate nodes g j Neighbor candidate node g in relational propagation graph k ∈κ(g j ) The relation between the relation and the category probability (namely the relation detecting partThe method comprises the following steps:
wherein r is jk And p is as follows jk Each representing a candidate node g j With its neighbor node g k The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e t Text characterization vectors representing barrage text information within a scene in which the candidate node is located;and->Representing candidate node g j With its neighbor node g k Sex and age characterization vector of (c), κ (g) j ) Representing candidate node g j Is described herein).
In accordance with a pre-labeled relationship graph G between query nodes r Finding all the query nodes q among the query nodes i Construction of Rel j A set of query nodes of the same relationship:
Q ij ={(q u ,r jk )|G r (q i ,q u )=r jk ,(r jk ,p jk )∈Rel j }
wherein G is r (q i ,q u ) Represented in the relationship graph G r Mid-query results in a query node pair (q i ,q u ) The relationship formed;
candidate node g is calculated by the following formula j And query node q i Is a relationship index similarity:
Sim r (q i ,g j )=max{p jk *Sim v (q u ,g k )|(r jk ,p jk )∈Rel j ,(q u ,r jk )∈Q ij }。
2) For candidate node g s And g is equal to j The relationship guideline similarity is calculated in a similar manner, and the calculating step comprises the following steps:
extracting candidate node g s And g is equal to j Relation r of (2) sj And class probability p sj By querying a relationship graph G between pre-labeled query nodes r Obtaining a query node pair set with the same relationship in the query nodes:
Q′ sj ={(q t ,q l )|G r (q t ,q l )=r sj }
wherein G is r (q t ,q l ) Represented in the relationship graph G r Mid-query results in a query node pair (q t ,q l ) The relationship formed;
utilizing query node pair set Q' sj Obtaining candidate node g s And g is equal to j Corresponding ones of the query nodes:
candidate node g is calculated by the following formula j And query node q i Is a relationship index similarity:
Sim r (g s ,g j )=p sj *Sim v (q t′ ,q l′ )。
3. tag propagation.
After the similarity calculation of social relation perception is obtained, label propagation in the same scene is utilized, so that time sequence information in the same scene can be combined, the problem of semantic purity reduction caused by different scenes is avoided, and a more accurate character retrieval result is obtained by iteratively updating the category vector of each candidate node.
In the embodiment of the invention, the class vector of each candidate node is updated by combining the characteristic aggregation mode of the neighbor nodes; mainly provides the following two updating modes, and optionally one of the updating modes can be:
the first update formula is:
wherein,representing candidate node g j Category vector at iteration round t+1, < >>Representing candidate node g j Is set of neighbor candidate nodes κ (g) j ) Candidate node g s Category vectors at iteration round t; omega js Representing a relationship weight calculated using the similarity between the nodes;
in the second mode, forAnd c, updating the value of the dimension c through the maximum confidence coefficient to reduce the influence caused by noise, wherein the formula is as follows:
wherein,representing candidate node g j Is set of neighbor candidate nodes κ (g) j ) Candidate node g s Class vector at iteration of round t +.>The value of dimension c.
Generally, the iteration lasts for about 10-20 rounds, the largest dimension in the class vector of each candidate node is finally selected as the personal identity identifier, each dimension in the class vector represents the probability of belonging to one target query person, the dimension number is the number of query nodes (target query persons), that is, the maximum probability represents that the candidate node is a certain target query person, and the image of the person region corresponding to the candidate node and the class vector after the candidate node is iterated are used as the retrieval result of the corresponding target query person.
According to the scheme provided by the embodiment of the invention, the character retrieval model with good effect can be learned by fully utilizing the existing data (referring to the scheme provided by the invention), and the target character in the specific complex video is searched on the basis, so that the effect of the pure visual character retrieval model is improved by utilizing the social relationship.
As will be appreciated by those skilled in the art, the identity of the person in the training data is known; the training process mainly comprises the following steps of: the training method adopts a traditional classification method and is not repeated.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. A method for searching a complex video character with enhanced social relationship, comprising:
sampling the video to be searched to obtain a video frame sequence, and extracting corresponding text information from the video to be searched;
performing person detection and scene segmentation on the video frame sequence, and establishing a person region set contained in each scene by combining a timestamp of the video to be searched; constructing a corresponding relation propagation graph for each character area set, wherein the character areas are used as nodes, called candidate nodes, the candidate nodes are connected in a complete graph mode, initializing a category vector for each candidate node, and connecting each candidate node with all query nodes in the relation graph corresponding to a given target query character set;
for each relation propagation graph, predicting the relation and class probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character;
the formula for predicting the relation and the category probability between two candidate nodes by using the corresponding text information is as follows:
wherein r is kj And p is as follows kj Each representing two candidate nodes g k And g is equal to j The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e t A text token vector representing text information within a scene in which the candidate node is located;for two candidate nodes g k And g is equal to j Sex characterization of->For two candidate nodes g k And g is equal to j Age characterization of (c);
the formulas for calculating the similarity between candidate nodes and query nodes are:
Sim(n x ,n y )=Sim v (n x ,n y )+α r *Sim r (n x ,n y )
wherein n is x And n y Representing two nodes, wherein when one is a candidate node, the other is a candidate node or a query node; alpha r Weights representing relationship similarity, sim v Representing visual similarity, and calculating by using visual features of the two nodes; sim (Sim) r Representing the relationship guiding similarity, and calculating by using the relationship and the category probability between the candidate nodes;
the calculation formula of the visual similarity is as follows:
wherein alpha is b The weight is represented by a weight that,and->Representing the re-identification feature corresponding to two nodes, < ->And->Representing facial features corresponding to the two nodes, cos ()' representing cosine similarity between the features; the re-recognition feature and the facial feature are taken as visual features corresponding to the nodes;
if two nodes are candidate node and query node, the two nodes are marked as candidate node g j And query node q i The calculating step of the relationship guideline similarity includes:
extracting candidate node g j And all candidate nodes g j Neighbor candidate node g in relational propagation graph k ∈κ(g j ) The relation to class probability:
wherein r is jk And p is as follows jk Each representing a candidate node g j With its neighbor node g k The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e t Text characterization vectors representing barrage text information within a scene in which the candidate node is located;and (3) withRepresenting candidate node g j With its neighbor node g k Sex and age characterization vectors of (c); kappa (g) j ) Representing candidate node g j Neighbor candidate node sets of (a);
in accordance with pre-labeled query nodesRelationship diagram G r Finding all the query nodes q among the query nodes i Construction of Rel j A set of query nodes of the same relationship:
Q ij ={(q u ,r jk )|G r (q i ,q u )=r jk ,(r jk ,p jk )∈Rel j }
wherein G is r (q i ,q u ) Represented in the relationship graph G r Mid-query results in a query node pair (q i ,q u ) The relationship formed;
candidate node g is calculated by the following formula j And query node q i Is a relationship index similarity:
Sim r (q i ,g j )=max{p jk *Sim v (q u ,g k )|(r jk ,p jk )∈Rel j ,(q u ,r jk )∈Q ij };
if both nodes are candidate nodes, the candidate node g is marked s And g is equal to j The calculating step of the relationship guideline similarity includes:
extracting candidate node g s And g is equal to j Relation r of (2) sj And class probability p sj By querying a relationship graph G between pre-labeled query nodes r Obtaining a query node pair set with the same relationship in the query nodes:
Q′ sj ={(q t ,q l )|G r (q t ,q l )=r sj }
wherein G is r (q t ,q l ) Represented in the relationship graph G r Mid-query results in a query node pair (q t ,q l ) The relationship formed;
utilizing query node pair set Q' ij Obtaining candidate node g s And g is equal to j Corresponding ones of the query nodes:
candidate node g is calculated by the following formula j And query node q i Is a relationship index similarity:
Sim r (g s ,g j )=p ij *Sim v (q k′ ,q l′ )。
2. the method for searching for a complex video character with enhanced social relationship according to claim 1, wherein the steps of sampling the video to be searched to obtain a video frame sequence, and extracting the corresponding text information from the video to be searched include:
for the video to be searched, a video frame sequence is obtained by adopting an equidistant sampling mode;
extracting corresponding text information from the video to be retrieved, and carrying out denoising and time axis correction processing; wherein, the text information includes: bullet screen text information and subtitle text information.
3. The method for searching for a complex video character with enhanced social relationship according to claim 1, wherein the step of performing character detection and scene segmentation on the video frame sequence and establishing a character region set included in each scene by combining a timestamp of the video to be searched comprises:
detecting a person region for each frame in the video frame sequence, extracting visual characteristics of each person region, and recording a timestamp of each person region corresponding to the video to be searched;
based on visual style transformation, dividing a video frame sequence into a plurality of video frame fragments, taking each video frame fragment as a scene, and recording a start-stop time stamp of each scene corresponding to a video to be searched;
and establishing a character region set corresponding to each scene based on the start-stop time stamp of the video to be searched corresponding to each scene and the time stamp of the video to be searched corresponding to each character region.
4. The method for searching for a complex video persona with enhanced social relationship of claim 1, wherein the visual features of the candidate node corresponding to the query node comprise two types of features: re-identifying features and facial features;
the character region images corresponding to the candidate nodes and the target query character images corresponding to the query nodes are respectively input into a pedestrian re-recognition network based on multi-scale convolution and a convolution network based on face recognition, and the two types of characteristics are extracted.
5. The method for searching the complex video characters with enhanced social relationship according to claim 1, wherein the category vector of each candidate node is updated by combining the characteristic aggregation mode of the neighbor nodes by adopting any one of the following modes:
in the first way, the update formula is:
wherein,representing candidate node g j Category vector at iteration round t+1, < >>Representing candidate node g j Is set of neighbor candidate nodes κ (g) j ) Candidate node g s Category vectors at iteration round t; omega js Representing a relationship weight calculated using the similarity between the nodes;
in the second mode, forThe value of the c dimension is updated by the maximum confidence, and the formula is:
wherein,representing class vector->The value of dimension c.
CN202110677925.XA 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship Active CN113343029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110677925.XA CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110677925.XA CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Publications (2)

Publication Number Publication Date
CN113343029A CN113343029A (en) 2021-09-03
CN113343029B true CN113343029B (en) 2024-04-02

Family

ID=77477338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110677925.XA Active CN113343029B (en) 2021-06-18 2021-06-18 Complex video character retrieval method with enhanced social relationship

Country Status (1)

Country Link
CN (1) CN113343029B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113676776B (en) * 2021-09-22 2023-12-26 维沃移动通信有限公司 Video playing method and device and electronic equipment
CN117201873B (en) * 2023-11-07 2024-01-02 湖南博远翔电子科技有限公司 Intelligent analysis method and device for video image

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061915A (en) * 2019-12-17 2020-04-24 中国科学技术大学 Video character relation identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9229958B2 (en) * 2011-09-27 2016-01-05 Hewlett-Packard Development Company, L.P. Retrieving visual media

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061915A (en) * 2019-12-17 2020-04-24 中国科学技术大学 Video character relation identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
綦金玮 ; 彭宇新 ; 袁玉鑫 ; .面向跨媒体检索的层级循环注意力网络模型.中国图象图形学报.2018,(11),全文. *

Also Published As

Publication number Publication date
CN113343029A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN111143610B (en) Content recommendation method and device, electronic equipment and storage medium
CN111428088B (en) Video classification method and device and server
CN112015949B (en) Video generation method and device, storage medium and electronic equipment
US9253511B2 (en) Systems and methods for performing multi-modal video datastream segmentation
CN110083741B (en) Character-oriented video abstract extraction method based on text and image combined modeling
US20080052312A1 (en) Image-Based Face Search
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
CN113343029B (en) Complex video character retrieval method with enhanced social relationship
WO2021047532A1 (en) Method and system for video segmentation
CN106446015A (en) Video content access prediction and recommendation method based on user behavior preference
Ul Haq et al. Personalized movie summarization using deep cnn-assisted facial expression recognition
CN113766299B (en) Video data playing method, device, equipment and medium
CN113010701A (en) Video-centered fused media content recommendation method and device
Dogan et al. A neural multi-sequence alignment technique (neumatch)
CN106529492A (en) Video topic classification and description method based on multi-image fusion in view of network query
Lv et al. Storyrolenet: Social network construction of role relationship in video
CN116361510A (en) Method and device for automatically extracting and retrieving scenario segment video established by utilizing film and television works and scenario
Acar et al. Breaking down violence detection: Combining divide-et-impera and coarse-to-fine strategies
Röthlingshöfer et al. Self-supervised face-grouping on graphs
Li et al. Social context-aware person search in videos via multi-modal cues
Narwal et al. A comprehensive survey and mathematical insights towards video summarization
CN110287376A (en) A method of the important vidclip of extraction based on drama and caption analysis
Bianco et al. Aesthetics assessment of images containing faces
Dai et al. Two-stage model for social relationship understanding from videos
Aly et al. Axes at trecvid 2013

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant