CN113343029B

CN113343029B - Complex video character retrieval method with enhanced social relationship

Info

Publication number: CN113343029B
Application number: CN202110677925.XA
Authority: CN
Inventors: 徐童; 陈恩红; 李丹; 周培伦; 何伟栋; 郝艳宾
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2024-04-02
Anticipated expiration: 2041-06-18
Also published as: CN113343029A

Abstract

On one hand, the social relationship is used as a tie, so that semantic information related to a target person can be fully mined and revealed, all the outgoing fragments of the target person can be accurately searched, and a foundation can be provided for supporting other related applications; on the other hand, the method can provide a more excellent complex character extraction model, and can obtain better results on subjective and objective indexes such as accuracy, recall rate and fluency.

Description

Complex video character retrieval method with enhanced social relationship

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for searching complex video characters with enhanced social relations.

Background

Person retrieval in a complex video is an important issue in video analysis, which is to extract all out-of-field segments containing a particular person in the video from a complete video. With the rise of mass emerging video media platforms, application scenes of the video media platforms in reality are wider. For some movie lovers or viewers loving a particular character, they may be willing to make special character-oriented summaries, such as clips of a particular star's attendance at a particular movie work, etc. And how to utilize computer technology to automatically and effectively extract and understand information in charge of video content, so as to search out target characters, better help people to quickly and accurately understand video content, and have good application value. Many intelligent video analysis functions have appeared in the current mainstream video media platform, so that users can more conveniently and quickly inquire and understand video contents. For example, the function of 'watching only he' proposed by the you ku video and the ai qi video can automatically generate clips of specific video characters according to user preference, thereby realizing concentration and key information extraction of massive video data.

However, the conventional person retrieval methods are based on visual characteristics, but rarely use the video to enrich high-level semantic information consisting of images and texts. In fact, in addition to the visual features formed by the video frames, the video also contains a large amount of different types of text information, such as subtitles and barrages, and the visual and text information can jointly reveal the scene information of the current segment, so that reliable high-level semantic clues can be provided when the visual information quality is low. Therefore, if the high-level semantic clues can be formally described and the person in the complex video is assisted to search, the task of the person-oriented video abstraction can be better completed. More importantly, when the semantic information is combined with the social relationship of the person, the social context formed by the semantic information provides a very important clue for the retrieval of the person. For example, if it can be determined that the current scene is related to school based on the "classmate" relationship between two persons present in the current frame, then other present persons have a high probability of having a teacher-student relationship with them, thereby narrowing down the candidate set of persons. It follows that social cue enhancement has great potential for the human detection task.

Disclosure of Invention

The invention aims to provide a complicated video character retrieval method with enhanced social relation, which abstracts the social relation in the video situation by combining visual information and multi-source text information, and provides reliable semantic clues for character retrieval in a complicated scene, thereby improving the accuracy of character retrieval results.

The invention aims at realizing the following technical scheme:

a method for searching complex video characters with enhanced social relationship comprises the following steps:

sampling the video to be searched to obtain a video frame sequence, and extracting corresponding text information from the video to be searched;

performing person detection and scene segmentation on the video frame sequence, and establishing a person region set contained in each scene by combining a timestamp of the video to be searched; constructing a corresponding relation propagation graph for each character area set, wherein the character areas are used as nodes, called candidate nodes, the candidate nodes are connected in a complete graph mode, initializing a category vector for each candidate node, and connecting each candidate node with all query nodes in the relation graph corresponding to a given target query character set;

for each relation propagation graph, predicting the relation and class probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character.

According to the technical scheme provided by the invention, on one hand, the semantic information related to the target person can be fully mined and revealed by taking the social relationship as the tie, so that all the outgoing fragments of the target person can be accurately searched, and a foundation can be provided for supporting other related applications; on the other hand, the method can provide a more excellent complex character retrieval model, and can obtain better results on subjective and objective indexes such as accuracy, recall rate and fluency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a model framework diagram of a social relationship enhanced complex video personage retrieval method provided by an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a complex video character retrieval method with enhanced social relationship, which mainly comprises the following steps:

step 1, sampling a video to be searched to obtain a video frame sequence, and extracting corresponding text information from the video to be searched.

In the step, the complete video to be searched can be sampled at equal intervals to obtain a sequence consisting of video frames; then, extracting corresponding text information from the video to be retrieved, and carrying out denoising and time axis correction processing; wherein, the text information includes: bullet screen text information and subtitle text information.

Step 2, performing person detection and scene segmentation on the video frame sequence, and establishing a person region set contained in each scene by combining a timestamp of the video to be retrieved; and constructing a corresponding relation propagation graph for each character area set, wherein the character areas are used as nodes, are called candidate nodes, the candidate nodes are connected in a complete graph mode, a category vector is initialized for each candidate node, and each candidate node is connected with all query nodes in the relation graph corresponding to the given target query character set.

The method mainly comprises three parts:

the first part is the detection of human beings: and detecting the person region for each frame in the video frame sequence, extracting the visual characteristics of each person region and recording the time stamp of each person region corresponding to the video to be searched.

The second part is scene segmentation: based on visual style transformation, a video frame sequence is divided into a plurality of video frame fragments, each video frame fragment is used as a scene, and the start and stop time stamp of each scene corresponding to the video to be searched is recorded.

The third part is to build a relational propagation diagram: firstly, establishing a character region set corresponding to each scene based on a start-stop time stamp of a video to be searched corresponding to each scene and a time stamp of a video to be searched corresponding to each character region, wherein each character region set corresponding to each scene comprises character regions with all time stamps within a start-stop time range of the scene; then, establishing a relation propagation diagram for each scene by utilizing the character region set; nodes in the relation propagation graph are human areas, which are called candidate nodes, edges are the similarity of the candidate nodes, and all the candidate nodes are connected with each other to form a complete graph; in addition, each candidate node maintains an iteratively updated class vector identifying the persona class to which it belongs. At the same time, edges pointing to each query node need to be regenerated for the candidate nodes.

Step 3, predicting the relation and class probability between two candidate nodes by combining corresponding text information for each relation propagation diagram; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character.

In this step, in the scene unit, the label relay of the relation perception is executed, and the category vector of each node in the relation propagation graph is calculated iteratively, so as to finally obtain the belonging person category of each node (namely the person region). The label propagation of relational awareness involves the following parts: firstly, detecting the person relations of different nodes in the current scene, then calculating the similarity between the nodes based on the detected person relations according to the thought of searching for the maximum social relation network matching, and finally aggregating the category vectors of the adjacent nodes according to the similarity between the nodes. Seeking the maximum social relationship network follows a two-point principle:

1) Different people with a specific social relationship often co-occur or alternate in short time, from which one person can be used to assist in identifying other people.

2) The characters in the video are interrelated with the scene, and social relationships can be used as links between the scene and the characters to reduce the candidate set of characters that may appear in the current scene.

When the relation-aware label is rebroadcast, label information in a relation graph among the pre-labeled query nodes is introduced, and finally, all scene fragments of people corresponding to the query nodes are found out from all scenes of the video to be detected.

As shown in fig. 1, a model frame diagram of the above method of the present invention is shown. The upper left corner is input data, and mainly comprises a Video frame sequence (Video Frames) and text information (Textual Documents) obtained by processing in the step 1, and further comprises a given target query character set and a corresponding Relation Graph (Relation Graph), wherein each target query character in the target query character set corresponds to a clear forward character region image, and social relations among the target query characters are marked in the Relation Graph; the top and the upper right corner correspond to the step 2, namely, after the person detection and the Scene segmentation processing, person Regions (Detected Regions) and Scene segments (Scene segments) are obtained, so as to obtain a person region set (such as Scene units 1-n in fig. 1) contained in each Scene, and related text information can be directly positioned according to the time stamp; thereafter, a relationship propagation graph (Graph for Scene Unit) is established using the set of persona areas in each scene, and each candidate node (Query node) in the relationship propagation graph generates an edge pointing to each Query node (Query node); finally, according to the Relation-aware label relaying (Relation-aware Label Propagation) method introduced in the step 3, obtaining a search result, namely, fragments of each target query person shown in the lower left corner in all scenes of the video to be detected. It should be noted that the various character images and text information shown in fig. 1 are for illustration only and not for limitation, and that the related character images are from existing public data sets, and that the use of the related character images for scientific research does not involve personal privacy.

For ease of understanding, preferred embodiments of the steps described above are described below.

1. And (5) preprocessing data.

In the embodiment of the invention, the processed data mainly refer to video data, text data and related data of target query characters.

For video data, a sequence of video frames may be obtained by equidistant sampling. Exemplary, sampling at a sampling rate of 0.5 frames/second results in a sequence of video frames, the particular sampling rate being set based on performance and efficiency trade-offs

For text data, extracting corresponding text information from the video to be retrieved, and then carrying out denoising and time axis correction processing; the text information herein mainly considers subtitle text and barrage text.

In particular, for high noise barrage text information, to filter out extraneous text, a regular rule may be used to filter out symbolic characters, etc., while correcting the barrage transmission time according to typing speed (typically about 30 words/min, as set by actual data), thereby obtaining a video frame sequence and corresponding corrected text information, i.e., preprocessed data.

For the related data of the target query characters, a corresponding target query character set can be generated according to the designated target query characters, each target query character selects a clear forward character region image, and social relations among the target query characters are marked manually.

2. And establishing a relation propagation diagram.

As described previously, this step includes three parts: person detection, scene segmentation, and graph creation. The preferred embodiments of the various parts are as follows:

1. and (5) detecting the person.

In the embodiment of the invention, according to the time of the video frame of the character area as a reference (0 s), a time window with a certain range is used for screening out the related text information. According to experience, in the embodiment of the invention, the time window of the bullet screen text is within the range of [ -10s,15s ] of the moment of the current frame, and the time window of the subtitle text is [ -45s,45s ]. The specific time window length can be adjusted as desired. These character areas and the filtered related text information pairs are used as input for character detection.

Person detection all person regions that appear in a sequence of video frames can be located frame-by-frame, indiscriminately, using the person detection method based on the fast R-CNN. And the character areas output by the FasterR-CNN are respectively sent into two channels to obtain visual characteristics of different layers. The first channel is a pedestrian re-recognition network (e.g., cross-Level Semantic Alignment) based on multi-scale convolution, through which the character region obtains an overall visual characteristic description, which can be understood as a characteristic of the whole body including dimensions of clothing, posture, stature, etc., and can be referred to as a re-recognition characteristic. The second channel is a face recognition based convolutional network (e.g., faceNet) that will extract its facial features (if any) from the input character region. The features of the two channels correspond to the visual feature descriptions of a person region at two different levels, which may be collectively referred to as visual features.

Illustratively, the fast R-CNN can be initialized using VGG-16 network, and then a simple classifier (whether human is involved) can be built using the fast R-CNN, and retrained on an image dataset containing only human beings in hopes of more accurate detection capability. The Cross-Level Semantic Alignment network can be initialized using the ResNet-50 structure to obtain its output 1024-dimensional re-identification characteristics. When facial features are extracted, MTCNN can be used as a face positioning structure, and 1024-dimensional facial features are obtained through FaceNet.

2. Scene segmentation.

PySceneDetect may be chosen as a scene divider that may divide the sequence of pre-processed video frames into different segments according to the visual style such that video frames located in the same segment are similar in visual style (e.g. background color and image elements).

Although scenes in the complex video have extremely variability, so that different appearances of the same person can have very large visual differences, after a series of scenes are obtained by taking visual separation as division, the visual characteristics of the same person in the same scene can be kept relatively stable, and time sequence information in the complex video is better utilized. Meanwhile, the social relation semantics in different scenes can be separated by dividing the scenes, so that the purity of the semantic information is improved. Finally, the start-stop time stamp of each scene output by the scene segmentation is recorded.

3. And (5) constructing a graph.

And combining the results of the first two parts, organizing all the character areas into character area sets in different scenes according to the start-stop time of the scenes, wherein each scene corresponds to the character area set and comprises all character areas with time stamps in the start-stop time range of the scenes.

And establishing a relation propagation diagram for the character areas in each scene, wherein the nodes of the diagram are character areas, the edges are the feature similarity of the nodes, and the features of the nodes comprise re-recognition vectors and face vectors. Simultaneously, each node maintains an iteratively updated class vector, and the person class to which the node belongs is identified; the stages in the relationship propagation graph are referred to as candidate nodes.

Remembering the target query person set q= { Q ₁ ，q ₂ ，...，q _n }, where q _i The character region representing the ith query character is obtained by the way of introduction of the previous character detection part, the re-identification characteristic and the facial characteristic of the character region are obtained, and each node q is obtained _i Considered as a query node, n represents the number of query nodes in the target query persona set. Recording candidate node set G= { G in single relation propagation diagram ₁ ，g ₂ ，...，g _m M represents the number of candidate nodes in a single relationship propagation graph. The rule for generating the edges is as follows: all candidate nodes belonging to the candidate node set G are connected in a complete graph and for each candidate node G _j All regenerate n points q ₁ ，q ₂ ，...，q _n Is a side of (2); finally obtainStripe edges and m+n nodes.

At the same time, a class vector (tag vector) is initialized for each node to identify the person's identity of its corresponding person's region. For query node q _i E Q, we initialize its tag vector to p _i (kept fixed, not updated), representing an n-dimensional one-hot code of dimension 1 of i; for all candidate nodes, their class vectors are initialized to 0 vector because their identity is unknown. In the description that follows, q is used as a query node, g is used as a candidate node, and the subscripts of q and g are used to identify different query nodes or candidate nodes.

3. Relationship-aware tag propagation.

The label propagation of the relation perception aims at aggregating label information of neighbor nodes of each node in the graph, so that timeliness of a scene is utilized, and a relation clue and visual characteristics are fused to complement each other. This phase is also divided into three parts: and (3) relation detection, node similarity calculation of relation perception and label propagation.

1. And (5) detecting the relationship.

For a single relation propagation graph, any two candidate nodes g _k And g is equal to _j And judging social relations among the candidate nodes by fusing the barrage and the image information. We collect together all the time-stamped barrages and subtitles within the scene to get the text representation e of its tf-idf vector ^t Meanwhile, g is respectively extracted by RudeCarnie _k And g is equal to _l Gender and age characteristics of (f) as fc7 outputAs a sex and age characterization. After that, the vectors with the several dimensions are mapped and spliced, and then the relation prediction result r is obtained through a calibrated support vector machine (calibrated SVM) _kj And class probability p _kj ：

Wherein,for two candidate nodes g _k And g is equal to _j Sex characterization of->For two candidate nodes g _k And g is equal to _j Age characterization of (c).

In the embodiment of the invention, text information in a scene is fused into a document, and tf-idf (term frequency-inverse document frequency) of the document is extractedRate) vector as a text representation of the entire scene, so that different candidate nodes are brought into the same e when calculating the relation and class probabilities ^t 。

RudeCarnie is a model proposed in the prior art for extracting gender and age characteristics; fc7 is the name of a particular neural network hierarchy in RudeCarnie, and is the fully connected layer with index number 7.

In the embodiment of the invention, the relationship mainly refers to social relationship, such as relatives, couples, friends, enemies, colleagues and the like; each dimension of the category probability represents the probability that two candidate nodes belong to the corresponding social relationship category, and the social relationship of the two candidate nodes is the category corresponding to the highest probability in the category probabilities.

2. And (5) calculating the node similarity of the relation perception.

In the embodiment of the invention, the relationship of the characters is integrated into the similarity calculation of the nodes so as to correct the visual information by the high-level semantic information and simultaneously play a complementary role. When node similarity is calculated, social relations are used as directions in two aspects: 1) People with a particular social relationship should co-occur or appear multiple times in the same scene; 2) Social relationships may act as a tie of people to the scene to reduce the candidate set of people that may appear in the current scene.

In the embodiment of the invention, the similarity is divided into two parts of visual similarity and relationship guiding similarity, and the formula for calculating the similarity between the nodes is as follows:

Sim(n _x ，n _y )＝Sim ^v (n _x ，n _y )+α _r *Sim ^r (n _x ，n _y )

wherein alpha is _r Weights (e.g., can be set to 1.2) representing relationship similarity, sim ^v Representing visual similarity, and calculating by using visual features of the two nodes; sim (Sim) ^r Representing the relationship guiding similarity, and calculating by using the relationship and the category probability between the candidate nodes; n is n _x And n _y Representing two nodes, wherein when one is a candidate node, the other is a candidate node or a query node; that is, the similarity calculation is divided into two cases,i.e., similarity calculation between candidate nodes, and similarity calculation between candidate nodes and query nodes.

In both cases, the visual similarity Sim ^r The calculation modes are the same, namely the re-identification characteristic e is fused ^id Cosine similarity of (2) and facial features e ^face The cosine similarity of (2) is calculated by the formula:

wherein alpha is _b The weight representing the feature (which may be set to 0.2),and->Representing the re-identification characteristics corresponding to the two nodes,and->The friends show facial features corresponding to the two nodes, and Cos () represents cosine similarity between the features; the re-recognition feature corresponds to the facial feature as a visual feature of the node.

The relationship index similarity Sim is described in detail below ^r As previously described, it is two cases: query node q _i And candidate node g _j Similarity calculation between the two; candidate node g _i And g is equal to _j And (5) calculating the similarity between the two.

1) For query node q _i And candidate node g _j The calculating step of the relation guide similarity comprises the following steps:

extracting candidate node g _j And all candidate nodes g _j Neighbor candidate node g in relational propagation graph _k ∈κ(g _j ) The relation between the relation and the category probability (namely the relation detecting partThe method comprises the following steps:

wherein r is _jk And p is as follows _jk Each representing a candidate node g _j With its neighbor node g _k The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e ^t Text characterization vectors representing barrage text information within a scene in which the candidate node is located;and->Representing candidate node g _j With its neighbor node g _k Sex and age characterization vector of (c), κ (g) _j ) Representing candidate node g _j Is described herein).

In accordance with a pre-labeled relationship graph G between query nodes _r Finding all the query nodes q among the query nodes _i Construction of Rel _j A set of query nodes of the same relationship:

Q _ij ＝{(q _u ，r _jk )|G _r (q _i ，q _u )＝r _jk ，(r _jk ，p _jk )∈Rel _j }

wherein G is _r (q _i ，q _u ) Represented in the relationship graph G _r Mid-query results in a query node pair (q _i ，q _u ) The relationship formed;

candidate node g is calculated by the following formula _j And query node q _i Is a relationship index similarity:

Sim ^r (q _i ，g _j )＝max{p _jk *Sim ^v (q _u ，g _k )|(r _jk ，p _jk )∈Rel _j ，(q _u ，r _jk )∈Q _ij }。

2) For candidate node g _s And g is equal to _j The relationship guideline similarity is calculated in a similar manner, and the calculating step comprises the following steps:

extracting candidate node g _s And g is equal to _j Relation r of (2) _sj And class probability p _sj By querying a relationship graph G between pre-labeled query nodes _r Obtaining a query node pair set with the same relationship in the query nodes:

Q′ _sj ＝{(q _t ，q _l )|G _r (q _t ，q _l )＝r _sj }

wherein G is _r (q _t ，q _l ) Represented in the relationship graph G _r Mid-query results in a query node pair (q _t ，q _l ) The relationship formed;

utilizing query node pair set Q' _sj Obtaining candidate node g _s And g is equal to _j Corresponding ones of the query nodes:

Sim ^r (g _s ，g _j )＝p _sj *Sim ^v (q _t′ ，q _l′ )。

3. tag propagation.

After the similarity calculation of social relation perception is obtained, label propagation in the same scene is utilized, so that time sequence information in the same scene can be combined, the problem of semantic purity reduction caused by different scenes is avoided, and a more accurate character retrieval result is obtained by iteratively updating the category vector of each candidate node.

In the embodiment of the invention, the class vector of each candidate node is updated by combining the characteristic aggregation mode of the neighbor nodes; mainly provides the following two updating modes, and optionally one of the updating modes can be:

the first update formula is:

wherein,representing candidate node g _j Category vector at iteration round t+1, < >>Representing candidate node g _j Is set of neighbor candidate nodes κ (g) _j ) Candidate node g _s Category vectors at iteration round t; omega _js Representing a relationship weight calculated using the similarity between the nodes;

in the second mode, forAnd c, updating the value of the dimension c through the maximum confidence coefficient to reduce the influence caused by noise, wherein the formula is as follows:

wherein,representing candidate node g _j Is set of neighbor candidate nodes κ (g) _j ) Candidate node g _s Class vector at iteration of round t +.>The value of dimension c.

Generally, the iteration lasts for about 10-20 rounds, the largest dimension in the class vector of each candidate node is finally selected as the personal identity identifier, each dimension in the class vector represents the probability of belonging to one target query person, the dimension number is the number of query nodes (target query persons), that is, the maximum probability represents that the candidate node is a certain target query person, and the image of the person region corresponding to the candidate node and the class vector after the candidate node is iterated are used as the retrieval result of the corresponding target query person.

According to the scheme provided by the embodiment of the invention, the character retrieval model with good effect can be learned by fully utilizing the existing data (referring to the scheme provided by the invention), and the target character in the specific complex video is searched on the basis, so that the effect of the pure visual character retrieval model is improved by utilizing the social relationship.

As will be appreciated by those skilled in the art, the identity of the person in the training data is known; the training process mainly comprises the following steps of: the training method adopts a traditional classification method and is not repeated.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for searching a complex video character with enhanced social relationship, comprising:

for each relation propagation graph, predicting the relation and class probability between two candidate nodes by combining corresponding text information; calculating the similarity between candidate nodes and between the candidate nodes and the query node by utilizing the visual features corresponding to the candidate nodes and the relation and the category probability between the candidate nodes, and updating the category vector of each candidate node by combining the neighbor node feature aggregation mode until convergence; each dimension in the category vector represents the probability of belonging to one target query character, the dimension number is the number of target query characters, the maximum probability in the converged category vector represents the candidate node as the corresponding target query character, and the image of the character area corresponding to the candidate node and the category vector iterated by the candidate node are taken as the retrieval result of the corresponding target query character;

the formula for predicting the relation and the category probability between two candidate nodes by using the corresponding text information is as follows:

wherein r is _kj And p is as follows _kj Each representing two candidate nodes g _k And g is equal to _j The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e ^t A text token vector representing text information within a scene in which the candidate node is located;for two candidate nodes g _k And g is equal to _j Sex characterization of->For two candidate nodes g _k And g is equal to _j Age characterization of (c);

the formulas for calculating the similarity between candidate nodes and query nodes are:

Sim(n _x ,n _y )＝Sim ^v (n _x ,n _y )+α _r *Sim ^r (n _x ,n _y )

wherein n is _x And n _y Representing two nodes, wherein when one is a candidate node, the other is a candidate node or a query node; alpha _r Weights representing relationship similarity, sim ^v Representing visual similarity, and calculating by using visual features of the two nodes; sim (Sim) ^r Representing the relationship guiding similarity, and calculating by using the relationship and the category probability between the candidate nodes;

the calculation formula of the visual similarity is as follows:

wherein alpha is _b The weight is represented by a weight that,and->Representing the re-identification feature corresponding to two nodes, < ->And->Representing facial features corresponding to the two nodes, cos ()' representing cosine similarity between the features; the re-recognition feature and the facial feature are taken as visual features corresponding to the nodes;

if two nodes are candidate node and query node, the two nodes are marked as candidate node g _j And query node q _i The calculating step of the relationship guideline similarity includes:

extracting candidate node g _j And all candidate nodes g _j Neighbor candidate node g in relational propagation graph _k ∈κ(g _j ) The relation to class probability:

wherein r is _jk And p is as follows _jk Each representing a candidate node g _j With its neighbor node g _k The relation prediction result and the category probability; SVM represents a calibration support vector machine; e, e ^t Text characterization vectors representing barrage text information within a scene in which the candidate node is located;and (3) withRepresenting candidate node g _j With its neighbor node g _k Sex and age characterization vectors of (c); kappa (g) _j ) Representing candidate node g _j Neighbor candidate node sets of (a);

in accordance with pre-labeled query nodesRelationship diagram G _r Finding all the query nodes q among the query nodes _i Construction of Rel _j A set of query nodes of the same relationship:

Q _ij ＝{(q _u ,r _jk )|G _r (q _i ,q _u )＝r _jk ,(r _jk ,p _jk )∈Rel _j }

wherein G is _r (q _i ,q _u ) Represented in the relationship graph G _r Mid-query results in a query node pair (q _i ,q _u ) The relationship formed;

Sim ^r (q _i ,g _j )＝max{p _jk *Sim ^v (q _u ,g _k )|(r _jk ,p _jk )∈Rel _j ,(q _u ,r _jk )∈Q _ij }；

if both nodes are candidate nodes, the candidate node g is marked _s And g is equal to _j The calculating step of the relationship guideline similarity includes:

Q′ _sj ＝{(q _t ,q _l )|G _r (q _t ,q _l )＝r _sj }

wherein G is _r (q _t ,q _l ) Represented in the relationship graph G _r Mid-query results in a query node pair (q _t ,q _l ) The relationship formed;

utilizing query node pair set Q' _ij Obtaining candidate node g _s And g is equal to _j Corresponding ones of the query nodes:

Sim ^r (g _s ,g _j )＝p _ij *Sim ^v (q _k′ ,q _l′ )。

2. the method for searching for a complex video character with enhanced social relationship according to claim 1, wherein the steps of sampling the video to be searched to obtain a video frame sequence, and extracting the corresponding text information from the video to be searched include:

for the video to be searched, a video frame sequence is obtained by adopting an equidistant sampling mode;

extracting corresponding text information from the video to be retrieved, and carrying out denoising and time axis correction processing; wherein, the text information includes: bullet screen text information and subtitle text information.

3. The method for searching for a complex video character with enhanced social relationship according to claim 1, wherein the step of performing character detection and scene segmentation on the video frame sequence and establishing a character region set included in each scene by combining a timestamp of the video to be searched comprises:

detecting a person region for each frame in the video frame sequence, extracting visual characteristics of each person region, and recording a timestamp of each person region corresponding to the video to be searched;

based on visual style transformation, dividing a video frame sequence into a plurality of video frame fragments, taking each video frame fragment as a scene, and recording a start-stop time stamp of each scene corresponding to a video to be searched;

and establishing a character region set corresponding to each scene based on the start-stop time stamp of the video to be searched corresponding to each scene and the time stamp of the video to be searched corresponding to each character region.

4. The method for searching for a complex video persona with enhanced social relationship of claim 1, wherein the visual features of the candidate node corresponding to the query node comprise two types of features: re-identifying features and facial features;

the character region images corresponding to the candidate nodes and the target query character images corresponding to the query nodes are respectively input into a pedestrian re-recognition network based on multi-scale convolution and a convolution network based on face recognition, and the two types of characteristics are extracted.

5. The method for searching the complex video characters with enhanced social relationship according to claim 1, wherein the category vector of each candidate node is updated by combining the characteristic aggregation mode of the neighbor nodes by adopting any one of the following modes:

in the first way, the update formula is:

in the second mode, forThe value of the c dimension is updated by the maximum confidence, and the formula is:

wherein,representing class vector->The value of dimension c.