CN111008558A

CN111008558A - Picture/video important person detection method combining deep learning and relational modeling

Info

Publication number: CN111008558A
Application number: CN201911042034.6A
Authority: CN
Inventors: 郑伟诗; 洪发挺
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-04-14
Anticipated expiration: 2039-10-30
Also published as: CN111008558B

Abstract

The invention discloses a picture/video important person detection method combining deep learning and relational modeling, which comprises the following steps: s1, extracting the appearance information and the geometric information of the portrait in the picture/video, and fusing the appearance information and the geometric information into a personal characteristic representing high-level semantics; s2, by mining the relation between people and scenes in the scene, calculating the relation characteristics which cannot be expressed or cannot be highly expressed by the individual characteristics; and S3, performing importance classification, performing important or unimportant binary classification on the final characteristic expression of each portrait extracted from the relational computation model, and taking the probability of each portrait being classified into the important category as an importance score, wherein the portrait with the highest score is the important person identified by the relational computation model. By the method and the device, the relationship between the people in the picture/video and the relationship between the people and the events in the picture can be automatically constructed through learning, and the importance degree of the people can be automatically deduced.

Description

Picture/video important person detection method combining deep learning and relational modeling

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a picture/video important person detection method combining deep learning and relational modeling.

Background

The picture/video important person detection means that in a given picture containing a plurality of figures, the important person in the picture/video is obtained according to the wearing, action, position, interaction information and the scene of the person. The technology can be helpful for scene understanding and development of industries such as live text, film and television shooting and safety monitoring. For example, in live text broadcasting, what happens in a scene can be judged according to the behavior of a central character of a video, and text description can be directly generated. During live sports, the method is applied to detect important characters in a sports scene, such as a ball holder in a basketball or football match, and then a camera is used for tracking, so that the labor consumption is reduced. In the security work, important person detection is carried out in video monitoring, important protected objects in a scene are monitored, persons with abnormally high important scores are analyzed, and a proper prevention and control plan is carried out.

The existing picture/video important face detection mainly comprises the following two types:

1) sequencing based on pedestrian pairs: in order to automatically detect important pedestrians in the picture, the most direct way is to form a pedestrian pair for every two pedestrians in the picture, and predict the importance degree relation of the pedestrian pair. Therefore, it is proposed in the prior art to use a regression model to infer the importance relationship between two different people in a picture, and to infer the most important face in the picture from such importance relationship of a pedestrian pair.

2) Ranking based on perceptron: the most important people in the picture or video play a great role in the identification and detection of events in the video. Put forward among the prior art, carry out action characteristic and appearance characteristic extraction to the sportsman of difference in the basketball match, calculate different sportsman's important degree through the perceptron to promote the rate of accuracy to the discernment of event and detection in the basketball match.

3) And based on the multilayer mixed relational graph, judging whether a person is the most important person in the scene only depends on the appearance information and the action information of the person, and more importantly, the interactive information between the persons is important. Therefore, in the prior art, a hybrid relationship graph is constructed for pedestrians detected in a picture by using different features to model the relationship between the pedestrians in the picture, and the famous ranking algorithm PageRank is improved to be used for ranking the importance degrees of the pedestrians in the multilayer hybrid relationship graph, so that the most important pedestrian in the picture is finally detected.

However, the above-mentioned existing important face detection methods have many disadvantages. The technology for sequencing based on the pedestrian pairs is provided by extracting the spatial features and the significant features of the faces of the pedestrians and sequencing the pedestrian pairs so as to sequence the importance degrees of the pedestrians. According to the method, when the importance degrees of the pedestrians are ranked, the importance degrees of other people and the influence of the relationship among the pedestrians on the importance degrees are ignored. At the same time, the method also ignores the role of context information, motion information, appearance information, and attention information on important pedestrian detection. In the technology based on the sensor sequencing, the sensor is used for directly calculating the importance degree of the pedestrian based on the characteristics of each pedestrian. This ignores the effect of relationships between pedestrians on the importance analysis. In addition, the method also ignores the role of spatial information as well as attention information. The technology based on the multilayer mixed relational graph adopts the characteristics that the information is obtained by pre-training from other tasks, the information can not be well expressed at a high-level semantic level, and only the relation between people is considered, but the relation between people and scenes is not considered.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a picture/video important person detection method combining deep learning and relational modeling, which can autonomously construct the relationship between persons in a picture/video and the relationship between the persons and events in the picture through learning and automatically deduce the importance degree of the persons.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a picture/video important person detection method combining deep learning and relational modeling, which comprises the following steps of:

s1, extracting the appearance information and the geometric information of the portrait in the picture/video, fusing the appearance information and the geometric information of the portrait into a personal characteristic representing high-level semantics, and extracting the information of the whole picture/video as a global characteristic;

s2, calculating relation characteristics which cannot be expressed or cannot be highly expressed independently depending on personal characteristics by mining the relation between people and scenes in the scene, and fusing the relation characteristics rfeat and personal information pfeat to generate importance characteristics which can highly express the importance of the individuals in the scene, wherein the relation characteristics rfeat comprises the information of the relation between people and scenes;

and S3, performing importance classification, performing important or unimportant binary classification on the final characteristic expression of each portrait extracted from the relational computation model, and taking the probability of each portrait being classified into the important category as an importance score, wherein the portrait with the highest score is the important person identified by the relational computation model.

As a preferred technical solution, step S1 specifically includes:

inputting the picture into a face detector or a pedestrian detector to extract a detection frame of the face or the pedestrian in the picture

wherein [x_pi,y_pi,w_pi,h_pi]Is p_iDetection frame, wherein [ x ]_pi,y_pi]Is a pedestrian p_iPosition in the picture, [ w ]_pi,h_pi]Is p_iThe width and height of the frame detected in the picture.

As a preferable embodiment, in step S1, the personal information of the person is characterized by the following method:

based on a pedestrian detection frame, personal information pfeat is automatically extracted from bottom to top by using a convolutional neural network, and meanwhile, in order to research the relationship between people and scenes, the global feature gfeat of the whole picture is also extracted:

wherein ,

which represents the global feature gfeat,

representing personal characteristics pfeat, f^oA representative feature extraction module, I represents the whole picture information, p_iRepresenting personal information, [ theta ]^oAre parameters of the feature extraction module.

As a preferred technical solution, in step S1, the method for fusing the personal features representing high-level semantics includes:

on the aspect of the feature space, a plurality of features are connected in series and then convoluted together, so that the personal features with high-level semantics are generated.

Preferably, in step S2,

modeling relationships between people and scenes, specifically comprising the following steps:

s21, calculating the relationship between people: matrix projection is carried out on the characteristics between every two characteristics, then the characteristics are added, then matrix projection is carried out, a numerical value is obtained, the strength of the connection between every two characteristics is represented, finally, truncation operation is carried out, and the condition that the strength is less than 0 is forcibly set to be 0;

s22, calculating the relation between the person and the scene: adding the characteristics between the people and the scene, then performing matrix projection to obtain a numerical value which represents the strength of the connection between the people and the scene, and finally performing truncation operation, wherein the value which is less than 0 is forcibly set as 0;

s23, fusing the relations obtained in the steps S21 and S22 to obtain the importance relation between every two relations: multiplying the two values, and if any one value is small, obtaining a small value;

s24, obtaining an n-n matrix by the step S23, wherein the ith row shows the relationship of all persons to the ith person and is used for integrating the importance relationship of all persons to the ith person;

and S25, calculating the corresponding relation characteristics of each person.

As a preferred technical scheme, a relational feature rfeat corresponding to each person is calculated, specifically, the formula is as follows:

a) calculating the relationship between people:

b) calculating the relationship between the person and the scene:

c) fusing multiple relationships:

d) calculating the importance relation:

e) calculating a relation characteristic rfeat:

f) constructing an importance characteristic ifeat for importance evaluation:

all W in the above formula is matrix, f is eigenvector, the relation epsilon is scalar value, the superscript 1 in f), …, r in f) represents r relation calculation modules, because they can be overlapped, Concat represents splicing operation, the whole relation calculation module is modeled as the following formula:

as a preferred technical solution, in step S3, the relationship module specifically includes:

based on

And different characteristics and relation functions for constructing a relation graph of human and scene

And interpersonal relationship diagram

Through the relationship graphs, the relationship characteristics rfeat between persons and scenes are calculated, and then the relationship characteristics rfeat are added and fused with the original personal characteristics pfeat into the importance characteristics ifeat which the importance scores can be judged.

Preferably, in step S3,

inputting the acquired importance characteristic ifeat into a neural network consisting of full connection layers for classification, taking the scores of the classified important characters as importance scores, and calculating the scores as follows:

wherein f^sTo the importance classification module, θ^sFor the corresponding parameters, up to this point, the whole network framework can be formulated as follows:

compared with the prior art, the invention has the following advantages and beneficial effects:

1. by removing the parameters of the self-learning framework through a deep learning algorithm, a better parameter set can be selected.

2. The relation calculation module can automatically learn the relation graphs between people and scenes in the scenes, adaptively encode the relation characteristics and understand the relation in the scenes from a higher angle.

3. The invention has less requirement on additional manual marking, does not need to mark the posture information of the pedestrian, does not need to calculate the definition of the person in the picture, and can carry out rapid training only by marking which important pedestrian is under the condition of detecting the pedestrian by using the detector, which is not possessed by the previous research.

4. The relation calculation module of the invention is embeddable and has iteration performance.

Drawings

FIG. 1 is a flow diagram of the present invention;

FIG. 2 is a representation of features of the present invention;

FIG. 3 is a construction relationship diagram of the present invention;

fig. 4(a) and 4(b) are graphs showing the detection result of an important pedestrian according to the present invention;

FIG. 5 is a schematic diagram of a relationship module of the model of the present invention

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in FIG. 1, the method of the present invention is POINT (deep import relationships NeTtracks). Firstly, the invention detects all pedestrians in the picture by a human face detector, then (a) extracts personal features and global features by using a feature expression module, then the features are input into a relationship calculation module (b), the whole relationship module consists of r sub-relationship modules, in each sub-relationship module, a person-to-person (p2p) relationship graph and a person-to-event (p2e) relationship graph are constructed, then the importance relationship is estimated from the two graphs, and then the relationship features are coded and spliced with the original personal features pfeat to obtain the characteristic features. Finally, the importance features are input into (c) an importance classification module to score the importance of each person.

Specifically, the method for detecting important persons in pictures/videos by combining deep learning and relational modeling includes the following steps:

(1) detecting pedestrians and extracting important characteristics of the pedestrians in the picture;

given a value of oneFirstly, inputting the image into a face detector or a pedestrian detector to extract a detection frame of the face or the pedestrian in the image

wherein [x_pi,y_pi,w_pi,h_pi]Is p_iDetection frame, wherein [ x ]_pi,y_pi]Is a pedestrian p_iPosition in the picture, [ w ]_pi,h_pi]Is p_iThe width and height of the frame detected in the picture. In particular, the individual face or body information is not sufficient to characterize the overall information of the person, e.g., the position geometry information. The present application therefore references contextual information and location of a persona in order to better characterize the persona's personal information. Based on the pedestrian detection box, the embodiment utilizes the convolutional neural network to automatically extract personal information pfeat from bottom to top, and simultaneously, in order to research the relationship between people and scenes, the global feature gfeat of the whole picture is also extracted:

wherein ,

which represents the global feature gfeat,

representing personal characteristics pfeat, f^oA representative feature extraction module, I represents the whole picture information, p_iRepresenting personal information, [ theta ]^oAre parameters of the feature extraction module. The specific feature extraction operation flow is shown in fig. 2, wherein the appearance information is divided into an internal part and an external part, the internal area extracts more appearance information inherent to the portrait, and the external area extracts more context information of the portrait appearance and the surrounding environment, so that diversification of the portrait information is ensured. At the same time, the map represented by a 01 value represents all the position information of the person, and the global scene information of the whole photoFeature extraction is achieved by a convolutional neural network.

(2) And (3) calculating the relation:

obtaining pfeat and gfeat by relying on the previous step, designing an embeddable and overlappable relation calculation module, modeling the relation between people and between scenes, calculating the corresponding relation characteristic rfeat of each person,

and S25, calculating the corresponding relation characteristics of each person.

The calculation process is as follows:

a) first, the interpersonal relationship is calculated:

b) calculating the relationship between the person and the scene:

c) fusing multiple relationships:

d) calculating the importance relation:

e) calculating a relation characteristic rfeat:

f) constructing an importance characteristic ifeat for importance evaluation:

the whole process of constructing the importance feature ifeat is shown in fig. 5, wherein eq.3 is the c) process and eq.4 is the d) process. The constructed relationship diagram is shown in fig. 3, all the relationships between people and scenes in each diagram are presented in the form of numerical values, and the relationships between people and scenes are also presented in the form of numerical values, so that it is found that non-important people have obvious directivity to important people, and the directional numerical value is obviously higher than that of other people. And calculating the edge values of different relation graphs by using the relation calculation module through the characteristics calculated by the characteristic expression module, and then fusing the edge values together to form the importance relation. All W in the above formula is matrix, f is eigenvector, the relation epsilon is scalar value, the superscript 1 in f), …, r in f) represents r relation calculation modules, because the relation calculation modules can be overlapped, and Concat represents splicing operation. The entire relational computation module can be modeled as the following equation:

(3) and (3) importance classification:

based on

And different features and relationship functions, construct person andrelational graph of a scene

And interpersonal relationship diagram

Through the relationship graphs, the relationship characteristics rfeat between persons and scenes are calculated, and then the relationship characteristics rfeat are added and fused with the original personal characteristics pfeat into the importance characteristics ifeat which the importance scores can be judged. And inputting the importance characteristic ifeat acquired in the last step into a neural network consisting of full connection layers for classification, and taking the value of the important person as the importance value of the person. The score calculation can be written as:

wherein f^sTo the importance classification module, θ^sAre the corresponding parameters. To this end, the entire network framework can be formulated as follows:

as shown in fig. 4(a) -4 (b), the important pedestrian detection result based on the technology of the present invention. FIG. 4(a) shows the results on NCAABasketbill Image Dataset, and FIG. 4(b) shows the results on Multi-scene Image Dataset. The detection result of the invention is higher than the accuracy of the best algorithm (PersonRank) at present by more than 23.2 percent (NCAA)/7 percent (MS).

All parameters of the invention are depth network parameters, and the autonomous optimization is carried out by a random gradient descent method.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A picture/video important person detection method combining deep learning and relational modeling is characterized by comprising the following steps:

2. The method for detecting important persons in pictures/videos by combining deep learning and relational modeling as claimed in claim 1, wherein the step S1 is specifically as follows:

wherein [x_pi，y_pi，w_pi，h_pi]Is p_iDetection frame, wherein [ x ]_pi，y_pi]Is a pedestrian p_iPosition in the picture, [ w ]_pi，h_pi]Is p_iThe width and height of the frame detected in the picture.

3. The picture/video important person detection method combining deep learning and relational modeling according to claim 2, wherein in step S1, the personal information of the person is characterized by the following method:

wherein ,

which represents the global feature gfeat,

4. The method for detecting important persons in pictures/videos by combining deep learning and relational modeling as claimed in claim 1, wherein in step S1, the method for fusing the personal features representing high-level semantics is as follows:

5. The picture/video important person detection method combining deep learning and relational modeling according to claim 1, wherein in step S2,

and S25, calculating the corresponding relation characteristics of each person.

6. The picture/video important person detection method combining deep learning and relational modeling according to claim 5, wherein a relational feature rfeat corresponding to each person is calculated, specifically, the formula is:

a) calculating the relationship between people:

b) calculating the relationship between the person and the scene:

c) fusing multiple relationships:

d) calculating the importance relation:

e) calculating a relation characteristic rfeat:

f) constructing an importance characteristic ifeat for importance evaluation:

all W in the above equation are matrices, f is eigenvector, and e is scalar value, the superscript in f) is 1.. r, r indicates that there are r relation calculation modules, since they can be overlapped, Concat represents splicing operation, and the whole relation calculation module is modeled as the following formula:

7. the method for detecting important persons in pictures/videos by combining deep learning and relational modeling as claimed in claim 1, wherein in step S3, the relational module is specifically:

based on

And interpersonal relationship diagram

8. The picture/video important person detection method combining deep learning and relational modeling according to claim 7, wherein in step S3,

s_i＝f^s(f_i ^I|θ^s)