CN111008558B

CN111008558B - Picture/video important person detection method combining deep learning and relational modeling

Info

Publication number: CN111008558B
Application number: CN201911042034.6A
Authority: CN
Inventors: 郑伟诗; 洪发挺
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2023-05-30
Anticipated expiration: 2039-10-30
Also published as: CN111008558A

Abstract

The invention discloses a picture/video important person detection method combining deep learning and relational modeling, which comprises the following steps: s1, extracting features of appearance information and geometric information of a portrait in a picture/video, and fusing the features into a personal feature representing high-level semantics; s2, calculating relation features which cannot be expressed or cannot be highly expressed by individual features by mining the relation between people in the scene and between people and the scene; s3, importance classification is carried out, and the probability that each portrait is classified into important category is taken as an importance score by carrying out important or unimportant two classifications on the final feature expression of each portrait extracted in the relation calculation model, and the portrait with the highest score is the important person identified by the relation calculation model. Through learning, the invention can autonomously construct the relationship between the characters in the picture/video and the relationship between the characters and the events in the picture, and automatically deduce the importance degree of the characters.

Description

Picture/video important person detection method combining deep learning and relational modeling

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a picture/video important person detection method combining deep learning and relational modeling.

Background

The important person detection of the picture/video refers to that in a given picture containing a plurality of figures, the important person in the picture/video is obtained according to wearing, actions, positions, interaction information and the scene where the person is located. The technology can be helpful for scene understanding and the development of industries such as character live broadcasting, film and television shooting, safety monitoring and the like. For example, in text live broadcast, what happens to a scene can be judged according to the behavior of a central character of a video, and text description can be directly generated. In sports live broadcast, the method is applied to detect important characters in sports scenes, such as a ball holder in basketball or football games, and then a camera is utilized to track, so that the manpower consumption is reduced. In security work, important person detection is carried out in video monitoring, important protection objects in a scene are monitored, and persons with abnormally high important scores are analyzed to carry out proper prevention and control plans.

The existing important face detection of pictures/videos mainly comprises the following two types:

1) Based on pedestrian pair ordering: in order to automatically detect important pedestrians in a picture, the most direct mode is to form pedestrian pairs for every two pedestrians in the picture, and predict the importance degree relation of the pedestrian pairs. Therefore, in the prior art, it is proposed to use a regression model to infer the importance relationship between two different people in a picture, and by using such importance relationship of pedestrian pairs, the most important face in the picture is inferred.

2) Based on perceptron ordering: the most important people in a picture or video have a great effect on the identification and detection of events in the video. In the prior art, action characteristics and appearance characteristics of different players in the basketball game are extracted, and importance degrees of the different players are calculated through a sensor, so that the accuracy of identifying and detecting events in the basketball game is improved.

3) Based on the multi-layer mixed relation diagram, whether a person is the most important person in the scene is judged, and the interaction information between people is more important than only depending on the appearance information and the action information. Therefore, in the prior art, it is also proposed that by using different features, a mixed relation graph is constructed for pedestrians detected in a picture to model the relation between pedestrians in the picture, and a famous ranking algorithm PageRank is improved so that the famous ranking algorithm PageRank can be used for ranking the importance degree of the pedestrians in the multi-layer mixed relation graph, and finally, the most important pedestrians in the picture are detected.

However, the above-mentioned important face detection method has many disadvantages. The technology based on pedestrian pair ordering provides that the spatial features and the remarkable features of the pedestrian faces are extracted, and the pedestrian pairs are ordered to order the importance degree of the pedestrians. The method ignores the importance degree of other people and the influence of the relation among pedestrians on the importance degree when sequencing the importance degrees of the pedestrians. At the same time, the method also ignores the effects of context information, motion information, appearance information, and attention information on important pedestrian detection. In the technology based on the ranking of the perceptrons, based on the characteristics of each pedestrian, it is proposed to directly calculate the importance degree of the pedestrian by using the perceptrons. This ignores the effect of the relationship between pedestrians on the importance level analysis. In addition, the method also ignores the effects of spatial information and attention information. The characteristics adopted in the technology based on the multi-layer mixed relation graph are pre-trained from other tasks, the information expressed on the high-layer semantic level cannot be well expressed, only the relation between people is considered, and the relation between people and scenes is not considered.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a picture/video important person detection method combining deep learning and relational modeling, which can automatically construct the relationship between the persons in the picture/video and the relationship between the persons and events in the picture through learning and automatically deduce the importance degree of the persons.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention discloses a picture/video important person detection method combining deep learning and relational modeling, which comprises the following steps:

s1, extracting features of appearance information and geometric information of a portrait in a picture/video, fusing the appearance information and the geometric information of the portrait into a personal feature representing high-level semantics, and extracting information of the whole picture/video as global features;

s2, calculating relation features which cannot be expressed or cannot be highly expressed by individual features by excavating relations among people in a scene and among people in the scene, and fusing the relation features rfeat and personal information pfeat to generate importance features which can highly express the importance of the individuals in the scene, wherein the relation features rfeat contain information of relations among people and between people and the scene;

s3, importance classification is carried out, and the probability that each portrait is classified into important category is taken as an importance score by carrying out important or unimportant two classifications on the final feature expression of each portrait extracted in the relation calculation model, and the portrait with the highest score is the important person identified by the relation calculation model.

As a preferred technical solution, step S1 specifically includes:

inputting the picture into a face detector or a pedestrian detector to extract the face or pedestrian in the picture

wherein [x_pi ,y _pi ,w _pi ,h _pi ]Is p _i Detection frame, wherein [ x ] _pi ,y _pi ]Is a pedestrian p _i Position in picture, [ w ] _pi ,h _pi ]Is p _i The width and height of the detected box in the picture.

As a preferable technical solution, in step S1, the personal information of the person is characterized by the following method:

based on the pedestrian detection frame, the personal information pfeat is automatically extracted from bottom to top by utilizing a convolutional neural network, and meanwhile, in order to study the relation between people and scenes, the global feature gfeat of the whole picture is also extracted:

wherein ,

representing the global feature gfeatl->

Representing personal characteristics pfeat, f ^o Representing feature extraction module, I represents whole picture information, p _i Representing personal information, θ ^o Is a parameter of the feature extraction module.

In step S1, as an preferable technical solution, the method for fusing the personal features representing the high-level semantics is as follows:

at the feature space level, multiple features are concatenated and convolved together, thereby generating high-level semantic personal features.

As a preferred technical solution, in step S2,

modeling the relationship between people and scenes, specifically:

s21, calculating the relationship between people: the features between every two are added after matrix projection, then matrix projection is carried out to obtain a numerical value which represents the connection strength between every two, and finally a cutting operation is carried out, wherein the forced value smaller than 0 is set to be 0;

s22, calculating the relation between people and scenes: adding the features between the person and the scene, then performing matrix projection to obtain a numerical value which represents the connection strength between the person and the scene, and finally performing a cut-off operation, wherein the forced value smaller than 0 is set to be 0;

s23, fusing the relations obtained in the steps S21 and S22 to obtain importance relations between every two: multiplying the two values, if any one value is small, the obtained value is small;

s24, an n matrix is obtained in the step S23, and the ith row represents the relationship between all persons and the ith person and is used for integrating the importance relationship between all persons and the ith person;

s25, calculating the corresponding relation characteristic of each person.

As a preferable technical scheme, calculating a corresponding relationship feature rfeat of each person, specifically, a formula is as follows:

a) Calculating the relationship between people:

b) Calculating the relationship between the person and the scene:

c) Fusing various relations:

d) Calculating importance relation:

e) Calculating a relation feature rfeat:

f) And (3) constructing an importance feature area of importance judgment:

all W in the above formula are matrices, f is a feature vector, the relation epsilon is a scalar value, the superscript 1, … in f), r represents r relation calculation modules, because they can be superimposed, concat represents a splicing operation, and the whole relation calculation module is modeled as the following formula:

as a preferred technical solution, in step S3, the relationship module specifically includes:

based on

And different features and relation functions, constructing a person-scene relation diagram +.>

And person-to-person relationship diagram->

Through the relation diagrams, the relation characteristic rfeats between people and scenes are calculated, and then the relation characteristic rfeats are fused with the original personal characteristic pfeats to judge the importance characteristic if eats of the importance score.

As a preferred technical solution, in step S3,

inputting the obtained importance feature ifeats into a neural network composed of all connected layers for classification, and taking the scores of important characters as importance scores thereof, wherein the score calculation can be written as follows:

wherein f^s For importance classification module, θ ^s For the corresponding parameters, the entire network framework can be formulated as follows:

compared with the prior art, the invention has the following advantages and beneficial effects:

1. the parameters of the self-learning framework are removed through the deep learning algorithm, and a better parameter set can be selected.

2. The relation calculation module can autonomously learn the relation graph between people in the scene and between people in the scene, adaptively encode the relation features, and understand the relation in the scene from a higher angle.

3. The invention has less requirement for additional manual labeling, does not need to label the gesture information of pedestrians, does not need to calculate the definition of the people in the picture, and can perform rapid training only by labeling important pedestrians under the condition that the pedestrians are detected by using the detector, which is not possessed by the previous researches.

4. The relation calculation module is embeddable and has iteratability.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a representation of the features of the present invention;

FIG. 3 is a construction relationship diagram of the present invention;

fig. 4 (a) and 4 (b) are graphs of important pedestrian detection results of the present invention;

FIG. 5 is a schematic diagram of a relational module of the model in the present invention

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples

As shown in fig. 1, the method of the present invention is POINT (deep importance relatIons NeTworks). Firstly, all pedestrians in a picture are detected through a face detector, then (a) personal features and global features are extracted by utilizing a feature expression module, then the features are input into a relation calculation module (b), the whole relation module is composed of r sub-relation modules, in each sub-relation module, a person-to-person (p 2 p) relation diagram and a person-to-event (p 2 e) relation diagram are constructed, importance relations are estimated from the two diagrams, then the relation features are encoded, and the characteristic features are obtained by splicing the relation modules with original personal features pfeat. Finally, inputting the importance characteristics into an importance classification module (c) to score the importance of each person.

Specifically, the method for detecting important figures in pictures/videos by combining deep learning and relational modeling in this embodiment includes the following steps:

(1) Pedestrian detection and important pedestrian feature extraction in the picture;

given a picture containing multiple pedestrians, the invention firstly inputs the picture into a face detector or a detection frame of the pedestrian detector for extracting the face or the pedestrian in the picture

wherein [x_pi ,y _pi ,w _pi ,h _pi ]Is p _i Detection frame, wherein [ x ] _pi ,y _pi ]Is a pedestrian p _i Position in picture, [ w ] _pi ,h _pi ]Is p _i The width and height of the detected box in the picture. In particular, face or personal information alone is not sufficient to characterize the overall information of a person, e.g., location geometry information. The present application refers to contextual information and location of a persona in order to better characterize the persona's personal information. Based on the pedestrian detection frame, the embodiment utilizes the convolutional neural network to automatically extract the personal information pfeat from bottom to top, and simultaneously, in order to study the relationship between people and scenes, the global feature gfeat of the whole picture is also extracted:

wherein ,

representing the global feature gfeatl->

Representing personal characteristics pfeat, f ^o Representing feature extraction module, I represents whole picture information, p _i Representing personal information, θ ^o Is a parameter of the feature extraction module. The specific feature extraction operation flow is shown in fig. 2, wherein the appearance information is divided into an internal part and an external part, the internal area is used for extracting the inherent appearance information of the portrait, and the external area is used for extracting the appearance of the portrait and the context information of the surrounding environment, so that the diversification of the portrait information is ensured. Meanwhile, a map chart expressed by a 01 value represents all the information of the positions of the characters, and in addition, the global scene information of the whole photo can also realize feature extraction through a convolutional neural network.

(2) And (3) calculating the relation:

the previous step is relied on to obtain pfeat and gfeat, an embeddability and superposition relation calculating module is designed, the relation between people and scenes is modeled, the relation characteristic rfeat corresponding to each person is calculated,

s25, calculating the corresponding relation characteristic of each person.

The calculation process is as follows:

a) First, the relationship between people is calculated:

b) Calculating the relationship between the person and the scene:

c) Fusing various relations:

d) Calculating importance relation:

e) Calculating a relation feature rfeat:

f) And (3) constructing an importance feature area of importance judgment:

the entire process of constructing the importance feature ifeat is shown in fig. 5, where eq.3 is the c) process and eq.4 is the d) process. The constructed relation graph is shown in fig. 3, all the relations between people in each graph are presented in the form of numerical values, and the relations between people and scenes are presented in the form of numerical values, so that the non-important people are found to have obvious directivity to important people, and the directivity numerical values are obviously higher than those of other people. And calculating the edge values of different relation graphs by utilizing the characteristics calculated by the characteristic expression module, and fusing the edge values to form an importance relation. All W in the above formula are matrices, f is a feature vector, the relation epsilon is a scalar value, the superscript 1, … in f) indicates that r relation calculation modules are available, and Concat indicates a splicing operation. The overall relationship calculation module can be modeled as the following formula:

(3) Importance classification:

based on

And person-to-person relationship diagram->

Through the relation diagrams, the relation characteristic rfeats between people and scenes are calculated, and then the relation characteristic rfeats are fused with the original personal characteristic pfeats to judge the importance characteristic if eats of the importance score. The importance feature ifeat obtained in the previous step is input into a neural network composed of all connected layers for classification, and the score divided into important characters is regarded as an importance score. The score calculation may be written as:

wherein f^s For importance classification module, θ ^s Is the corresponding parameter. To this end, the entire network framework can be formulated as follows:

as shown in fig. 4 (a) -4 (b), important pedestrian detection results based on the technology of the present invention. Fig. 4 (a) shows the results on NCAA Basketball Image Dataset and fig. 4 (b) shows the results on Multi-scene Important People Image Dataset. The detection result of the invention is higher than the accuracy of the best algorithm (PersonRank) of the prior art by more than 23.2% (NCAA)/7% (MS).

All parameters of the invention are deep network parameters, and autonomous optimization is carried out by a random gradient descent method.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A picture/video important person detection method combining deep learning and relational modeling is characterized by comprising the following steps:

in the step S2 of the process,

modeling the relationship between people and scenes, specifically:

s25, calculating corresponding relation characteristics of each person;

calculating the corresponding relation characteristic rfeat of each person, wherein the specific formula is as follows:

a) Calculating the relationship between people:

b) Calculating the relationship between the person and the scene:

c) Fusing various relations:

d) Calculating importance relation:

e) Calculating a relation feature rfeat:

f) And (3) constructing an importance feature area of importance judgment:

all W and W in the above formula are matrices, f is a feature vector, the relation epsilon is a scalar value, the superscript 1, … in f), r represents r relation calculation modules, because they can be superimposed, concat represents a splicing operation, and the whole relation calculation module is modeled as the following formula:

2. The method for detecting important figures in pictures/videos by combining deep learning and relational modeling according to claim 1, wherein step S1 is specifically:

3. The method for detecting important figures of picture/video combining deep learning and relational modeling as in claim 2, wherein in step S1, the personal information of the figures is characterized by the following method:

wherein ,

representing the global feature gfeatl->

4. The method for detecting important figures in pictures/videos by combining deep learning and relational modeling according to claim 1, wherein in step S1, the method for merging into a personal feature representing high-level semantics is as follows:

5. The method for detecting important figures in pictures/videos by combining deep learning and relational modeling according to claim 1, wherein in step S3, the relational module specifically comprises:

based on

And person-to-person relationship diagram->

6. The method for detecting a picture/video important person combining deep learning and relational modeling as recited in claim 5, wherein in step S3,