CN115601819A

CN115601819A - Multimode violence tendency recognition method, device, equipment and medium

Info

Publication number: CN115601819A
Application number: CN202211503571.8A
Authority: CN
Inventors: 张伟; 蒋静文; 何得淮; 何行知; 姚佳; 路浩; 王垒; 皮志兰
Original assignee: Sichuan Provincial Prison Administration; West China Hospital of Sichuan University
Current assignee: Sichuan Provincial Prison Administration; West China Hospital of Sichuan University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-01-13
Anticipated expiration: 2042-11-29
Also published as: CN115601819B

Abstract

The embodiment of the application provides a method, a device, equipment and a medium for identifying a multi-mode violence tendency, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a whole face image, an M individual face part area and an L individual face small area of each image to be detected of various groups of people; acquiring psychological evaluation data of various crowds, and acquiring psychological characteristics of various crowds according to the psychological evaluation data of various crowds; acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics; and calculating the whole face image, the M individual face part area, the L individual face small area, the psychological characteristics of various groups and the background characteristics of various groups of all the images to be detected of various groups of people through a multi-mode violence tendency recognition model to obtain a violence tendency recognition result. The method is suitable for various crowds, improves objectivity and accuracy of violence tendency identification, and improves robustness and generalization capability.

Description

Multimode violence tendency recognition method, device, equipment and medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for recognizing violence tendency based on multiple modes.

Background

Currently, most of the violence-related characteristics require a professional psychiatrist or psychologist to evaluate. In practical applications, whether violence exists is mostly identified in the form of questionnaires or scales, the evaluation results of the questionnaires or scales are disguised to some extent, the evaluation is time-consuming and labor-consuming, and the accuracy of the predicted results of most tools is low. Documents based on evolutionary psychology show that the human perception process is adaptively adjusted, and people can make judgments on honesty, personality, intelligence, sexual orientation, political orientation and violence tendency of other people according to facial features. The existing emotion recognition technology through a face picture only stays in the degrees of face recognition and expression recognition, a large gap exists between the real understanding of human emotion and personality, and a scheme for recognizing violence tendency is lacked. In the prior art, a scheme for identifying violence aiming at various modal data is lacked, so that the problem that the accuracy of identifying violence tendency is poor exists.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present application provide a method, an apparatus, a device, and a medium for identifying a multi-modal violence tendency.

In a first aspect, an embodiment of the present application provides a method for identifying a multi-modal violence tendency, where the method includes:

acquiring N images to be detected of various groups of people, and acquiring a whole face image, an M personal face part area and an L personal face small area of each image to be detected of various groups of people;

acquiring psychological evaluation data of various groups, and acquiring psychological characteristics of various groups according to the psychological evaluation data of various groups;

acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics;

and calculating the whole face image, the M individual face part area, the L individual face small area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds by using a multi-mode violence tendency recognition model to obtain a violence tendency recognition result.

In an embodiment, the obtaining of the whole face image, the M-person face part region and the L-person face part region of each image to be detected of each type of people includes;

carrying out face detection on each image to be detected to obtain a face frame and face key points of each image to be detected;

performing face alignment processing on each image to be detected according to the face key point of each image to be detected to obtain an aligned image of each image to be detected;

cutting each aligned image according to the face frame and/or the face key point of each aligned image to obtain a face integral image of each image to be detected;

performing face part region division on the face region of each aligned image according to the face key point of each aligned image to obtain M individual face part regions corresponding to the images to be detected;

and carrying out small face region division on each whole face image according to the prior information of the muscle region of the face and the key point of the face of each whole face image to obtain an L small face region corresponding to the image to be detected.

In one embodiment, obtaining various types of crowd psychological characteristics according to the various types of crowd psychological assessment data comprises:

calculating the multidimensional psychological scores of the psychological evaluation data of various groups according to the scoring rules of the psychological evaluation tools of various groups;

standardizing the multi-dimensional psychological scores of the psychological assessment data of various groups to obtain the psychological multi-dimensional characteristics of various groups, and connecting the psychological multi-dimensional characteristics of various groups in series to obtain the psychological characteristics of various groups;

the format adjustment processing of the various types of crowd background information comprises the following steps:

extracting general population informatics data and population specific information data of various populations from the background information of various populations;

and standardizing or coding the general population informatics data and the population specific information data of various populations according to the data types.

In one embodiment, the multi-modal violence tendency recognition model comprises a face module, wherein the face module comprises a face overall image feature extraction network, a face part region feature extraction network and a face small region feature extraction network;

calculating the whole face image, the M individual face part area and the L individual face small area of each image to be detected of various groups of people through a multi-mode violence tendency recognition model, and the method comprises the following steps:

calculating the whole face image of each image to be detected of each crowd through the whole face image feature extraction network to extract a first whole face feature;

calculating M individual facial part areas of each image to be detected of various groups of people through the facial part area feature extraction network to extract first facial part area features;

calculating L individual face small areas of each image to be detected of various groups of people through the face small area feature extraction network to extract features of a first face small area;

acquiring a total human face output characteristic according to the first overall human face characteristic, the first human face part region characteristic and the first human face region;

respectively acquiring a first output feature map, a second output feature map and a third output feature map from a fourth Sefuse _ Net submodule of the face whole image feature extraction network, the face component region feature extraction network and the face small region feature extraction network;

and acquiring face fusion features according to the first output feature map, the second output feature map and the third output feature map.

In an embodiment, the calculating, by the face part region feature extraction network, M face part regions of each image to be measured of each category of people to extract first face part region features includes:

calculating the M personal facial part regions of the images to be detected of all kinds of people through the facial part region feature extraction network so as to extract the M personal facial part region features of the images to be detected of all kinds of people;

averaging the extracted facial feature region features to obtain the first facial feature region feature;

the L individual face small region of each image to be detected of all kinds of crowds is calculated through the face small region feature extraction network to extract the first individual face small region feature, including:

calculating the L personal face small area of each image to be detected of each crowd through the face small area feature extraction network to extract the L personal face small area of each image to be detected of each crowd;

averaging the extracted small human face regions to obtain the first small human face region characteristic;

the obtaining of the total output features of the human face according to the first overall human face features, the first human face part region features and the first human face cell includes:

connecting the first overall face feature, the first face part region feature and the first face cell feature in series to obtain the total face output feature;

the obtaining of the face fusion feature according to the first output feature map, the second output feature map and the third output feature map comprises:

respectively enabling the first output feature map, the second output feature map and the third output feature map to pass through a global average pooling layer to obtain a second face overall feature, M second face part regional features and L second face small regional features;

respectively averaging the M second face part regional characteristics and the L second face small regional characteristics to obtain third face part regional characteristics and third face small regional characteristics;

and connecting the second face overall feature, the third face part regional feature and the third face small regional feature in series to obtain the face fusion feature.

In one embodiment, the multi-modal violence propensity recognition model further comprises: a psychological module and a crowd background module;

the psychology module includes a first fully connected network branch and a second fully connected network branch;

the crowd background module comprises a third fully connected network branch and a fourth fully connected network branch;

calculating the psychological characteristics of various types of people and the background characteristics of various types of people through a multi-modal violence tendency recognition model, wherein the calculation comprises the following steps:

calculating the psychological characteristics of various groups of people through the first fully-connected network branch and the second fully-connected network branch to respectively obtain a psychological total output characteristic and a psychological fusion characteristic;

and calculating the various types of crowd background characteristics through the third fully-connected network branch and the fourth fully-connected network branch respectively to obtain crowd background total output characteristics and crowd background fusion characteristics respectively.

In one embodiment, the multi-modal violence tendency recognition model further comprises other modality modules, a fusion module and a classification module;

the other modality module comprises a fifth fully connected network branch and a sixth fully connected network branch;

the method further comprises the following steps:

respectively calculating other modal characteristics through the fifth fully-connected network branch and the sixth fully-connected network branch to respectively obtain other modal fusion characteristics and other modal total output characteristics;

calculating the face fusion feature, the psychological fusion feature, the crowd background fusion feature and the other modality fusion feature respectively through the fusion module to obtain a first output feature, a second output feature, a third output feature and a fourth output feature respectively; taking a feature mean of the first output feature, the second output feature, the third output feature, and the fourth output feature as an overall fused output feature; connecting the human face total output feature, the psychological total output feature, the crowd background total output feature, the other modal total output feature and the total fusion output feature in series to obtain a final output feature, and inputting the final output feature into a full connection layer to obtain a comprehensive feature;

carrying out violence tendency classification on the comprehensive characteristics through the classification module, wherein the violence tendency types comprise: the force tends to be violent and the force tends not to be violent.

In a second aspect, the present application provides an apparatus for identifying a multi-modal violence tendency, the apparatus including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring N images to be detected of various crowds and acquiring a face whole image, an M personal face part area and an L personal face small area of each image to be detected of the various crowds;

the second acquisition module is used for acquiring various types of crowd psychological assessment data and acquiring various types of crowd psychological characteristics according to the various types of crowd psychological assessment data;

the third acquisition module is used for acquiring various types of crowd background information and carrying out format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics;

and the recognition module is used for calculating the whole face image, the M individual face part area, the L individual face small area, the psychological characteristics of various crowds and the background characteristics of various crowds of each image to be detected of various crowds through the multi-mode violence tendency recognition model to obtain a violence tendency recognition result.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory is used for storing a computer program, and the computer program executes the multi-modal violence tendency recognition based method provided in the first aspect when the processor runs.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program, which, when running on a processor, executes the multi-modal violence propensity identification based method provided in the first aspect.

The method, the device, the equipment and the medium for recognizing the multi-modal violence tendency, which are provided by the application, are used for acquiring N images to be detected of various crowds, and acquiring a whole face image, an M individual face part area and an L individual face small area of each image to be detected of the various crowds; acquiring psychological evaluation data of various groups, and acquiring psychological characteristics of various groups according to the psychological evaluation data of various groups; acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics; and calculating the whole face image, the M individual face part area, the L individual face small area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds by using a multi-mode violence tendency recognition model to obtain a violence tendency recognition result. The method is suitable for various crowds, the face overall image, the M individual face part area, the L individual face part area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds are calculated through the multi-mode violence tendency recognition model, violence tendency is recognized on the basis of multi-mode data, objectivity and accuracy of violence tendency recognition are improved, and robustness and generalization ability are improved.

Drawings

In order to more clearly explain the technical solutions of the present application, the drawings needed to be used in the embodiments are briefly introduced below, and it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope of protection of the present application. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart of a multi-modal violence tendency recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an image to be inspected provided by an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an overall image of a human face according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a face component area provided in an embodiment of the present application;

FIG. 5 is a second schematic diagram of a face component area provided in an embodiment of the present application;

fig. 6 shows one of schematic diagrams of a small region of a human face provided in an embodiment of the present application;

fig. 7 is a second schematic diagram illustrating a small region of a human face according to an embodiment of the present application;

fig. 8 shows one of the schematic structural diagrams of a face module provided in the embodiment of the present application;

FIG. 9 illustrates one of the structural schematics of a multi-modal violence tendency recognition model provided by an embodiment of the present application;

FIG. 10 is a second schematic diagram illustrating a multi-modal violence tendency recognition model provided in an embodiment of the present application;

FIG. 11 is a third schematic diagram illustrating a structure of a multi-modal violence tendency recognition model provided in an embodiment of the present application;

FIG. 12 is a diagram illustrating a fourth example of the structure of a multi-modal violence tendency recognition model provided by an embodiment of the present application;

FIG. 13 illustrates a fifth example of a multi-modal violence tendency recognition model provided by an embodiment of the present application;

fig. 14 shows a schematic structural diagram of a multi-modal violence tendency recognition device provided in an embodiment of the present application.

Icon: 1400-multi-modal violence tendency based recognition means, 1401-a first acquisition module, 1402-a second acquisition module, 1403-a third acquisition module, 1404-a recognition module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present application, are intended to indicate only specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another, and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the present application belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments.

Example 1

The embodiment of the disclosure provides a multi-mode violence tendency identification method.

Referring to fig. 1, the multi-modal violence-based tendency recognition method includes steps S101 to S104, and the steps are explained below.

Step S101, obtaining N images to be detected of various groups of people, and obtaining a whole face image, an M individual face part area and an L individual face small area of each image to be detected of various groups of people.

In this embodiment, each image to be detected of each type of people may be any image including a face image, and N may be a numerical value greater than or equal to 2, which is not limited herein. Referring to fig. 2, fig. 2 shows an image to be detected including a face image generated according to software, and the image to be detected shown in fig. 2 is a virtual face generated by an image generation website.

In this embodiment, the face whole image includes a face five sense organ whole image. Referring to fig. 3, fig. 3 shows a whole face image captured based on the image to be detected shown in fig. 2. The human face part area is a five sense organs part obtained by segmenting the whole human face image according to the human face key points. There may be K face key points, and exemplarily, K may be 68. Illustratively, M may be 9. The small face region is a local face region obtained by segmenting the whole face image according to the prior information of the muscle of the face and the key of the face, and exemplarily, L may be 40. In other practical cases, M, L and K may be other values, and are not limited herein.

performing face part region division on the face region of each aligned image according to the face key point of each aligned image to obtain M individual face part regions corresponding to the image to be detected;

Exemplarily, the process of obtaining the whole face image of each image to be detected includes:

carrying out face detection on each image to be detected through a digital display (Dlib) software package to obtain a face detection frame and 68 face key points of each image to be detected; affine change is carried out on the key points of the human face, each image to be detected is aligned on the basis of eyes, each image to be detected is cut according to the 9 th, 20 th, 1 st and 17 th points of the key points of the human face and the human face detection frame, a whole image of the human face is obtained, specifically, a maximum range rectangular frame comprising frames formed by the human face detection frame and the 9 th, 20 th, 1 st and 17 th points can be obtained, the image to be detected is cut according to the maximum range rectangular frame, background information of non-human faces in the image to be detected is eliminated as much as possible, and interference on extraction of human face features is reduced; the generated 68 personal face keypoints are denoted as [ x1:68, y1:68] for the subsequent steps. For example, in the present embodiment, the coordinates of the 1 st point are [4,114], the coordinates of the 9 th point are [202,384], the coordinates of the 20 th point are [100,49], and the coordinates of the 68 th point are [176,284].

Exemplarily, the obtaining of the M-person face part regions of each of the images to be measured includes:

the whole face image is divided into 9 face parts by 68 recorded face key points, and the rectangular frame coordinates of the 9 large face parts are recorded as ([ upper left corner coordinates ], [ lower right corner coordinates ]).

Referring to fig. 4, fig. 4 is one of schematic diagrams of face part regions, where the coordinates of each face part region are as follows: left eye part: ([ x18-10, y 20-20 ], [ x28, y30 ]); right eye part: ([ x28, y 25-20 ], [ x27+10, y30 ]); the eyebrow center part: ([ x21, y 21-20 ], [ x24, y30 ]); left cheek part: ([ x4, y29], [ x32, y51 ]); right cheek portion: ([ x36, y29], [ x14, y53 ]); a nose portion: ([ x20, y29], [ x25, y53 ]); left corner of mouth part: ([ x6, y34], [ x35, y6 ]); right mouth corner portion: ([ x34, y34], [ x12, y12 ]); a mouth part: ([ x49-20, y34], [ x55+20, y58 +40 ]).

Referring to fig. 5, fig. 5 is a second schematic diagram of a face part region, and fig. 5 is a schematic diagram of the whole face image shown in fig. 3 divided into 9 face part regions, which includes: a left eye component region, an eyebrow region, a right eye component region, a left cheek component, a nose component region, a right cheek component region, a left corner component region, a lip component region, a right corner component region.

Illustratively, obtaining the sum L of the facial area of each of the images to be measured includes:

finding muscle positions related to violence based on research findings, dividing the whole face image into 40 small face areas through 68 face key points, wherein the 40 small face areas comprise the following muscle areas: the eyebrow-lowering, frown, orbicularis oculi, levator palpebrae superior, nasolabial, upper labial, orbicularis oris, and lower labial muscles.

Referring to fig. 6, fig. 6 shows one of schematic diagrams of a human face small region, and as shown by a small square in fig. 6, 40 central coordinates [ sx1:40, sy 1; according to the determined central coordinates [ sx1:40, sy 1.

Referring to fig. 7, fig. 7 is a second schematic diagram of a small region of a human face, and fig. 7 is a second schematic diagram of a whole human face image shown in fig. 3 divided into 40 regions of a human face, which includes: muscle area: the eye muscles include the eyebrow lowering muscles, the frown folding muscles, the orbicularis oculi muscles, the levator palpebrae superior eyelid, the levator nasolabialis, the levator labialis superior, the orbicularis oris, the lower labial muscles, and the like.

And S102, acquiring psychological evaluation data of various groups, and acquiring psychological characteristics of various groups according to the psychological evaluation data of various groups.

In an embodiment, the obtaining of various types of crowd psychological characteristics according to the various types of crowd psychological assessment data includes:

calculating the multidimensional psychological scores of the psychological evaluation data of various groups according to the calculation rule of the psychological evaluation tool of various groups;

standardizing the multi-dimensional psychological scores of the psychological evaluation data of various groups to obtain the psychological multi-dimensional characteristics of various groups, and connecting the psychological multi-dimensional characteristics of various groups in series to obtain the psychological characteristics of various groups.

Exemplarily, a five-person questionnaire (60-topic edition) is taken as an example, and the topics are given by a 5-minute Likter scale, namely, each topic is 1 to 5 minutes, and the total is 60. The set of quantitative score calculation rules are five dimensional scores comprising the neural quality, the camber, the patency, the friendliness and the rigor, wherein the neural dimension corresponds to the

question

1,6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, the

question

1,6, 21, 31 is a reverse score, and the question Xi represents the question score, so that the neural dimension score calculation formula is ((6-X1) + (6-X6) + X11+ X16+ X1+ (6-X21) + X26+ (6-X31) + X36+ X41+ X46+ X51 + X56); similarly, according to the four-dimensional score calculation rule of camber, openness, friendliness and rigidness specified in the five-personality questionnaire, respectively calculating the scores of each dimension of camber, openness, friendliness and rigidness; finally, obtaining 5 dimensionality scores by using the five-personality questionnaire; in this similar manner, the remaining psychological assessment data calculate the respective dimensional scores of all psychological assessments. Standardizing the dimension scores obtained by all psychological assessment by using a Z-score algorithm, wherein all standardized dimension scores are psychological multi-dimension characteristics; and connecting all psychological multidimensional characteristics in series to obtain various crowd psychological characteristics.

And step S103, acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics.

In one embodiment, the performing format adjustment processing on various types of people background information includes:

extracting general population informatics data and population specific information data of various populations from background information of various populations;

Exemplarily, two aspects of data are extracted from related background information of different crowds recorded in an information system, wherein the first aspect is general population informatics data which mainly comprises general information of each crowd, such as gender, age, ethnicity, academic calendar, height, weight and the like; a second aspect is different population specific information data; e.g. student population specific information: schools, grades, classes, academic scores, literature, teacher-student relationships, student-parent relationships, evaluation of teachers on students, and the like; criminal crowd: criminal name, criminal period, penalty, acquaintance attitude, parole, family relationship, brief description of criminal process, etc.; the patient population is as follows: disease type, biochemical test result, drug dosage, drug type, medical advice, medical record and the like. The extracted data is normalized or encoded by data type.

Exemplary, numerical data: such as age, height, weight, subject performance, penalties, etc., are normalized using the "Z-score" algorithm. Classifying data: such as gender, school calendar, names of guilties, grade, disease category, etc., using one-hot codes. Text data: such as teacher's evaluation of students, crime process brief description, case history record, etc., the text is encoded by adopting a BERT pre-training model in a natural language processing mode.

And step S104, calculating the whole face image, the M individual face part area, the L individual face part area, the psychological characteristics of various groups and the background characteristics of various groups of the images to be detected of various groups of people through a multi-mode violence tendency recognition model to obtain a violence tendency recognition result.

respectively acquiring a first output feature map, a second output feature map and a third output feature map from a fourth Sefuse _ Net submodule of the face whole image feature extraction network, the face part region feature extraction network and the face small region feature extraction network;

Referring to fig. 8, fig. 8 is a schematic structural diagram of a face module, where the face module includes a face whole image feature extraction network, a face part feature extraction network, and a face small region feature extraction network. The input of the face module is a whole face image, 9 face parts and 40 small face areas; the human face integral image is input into a human face integral image feature extraction network, the human face part is input into a human face part feature extraction network, and the human face small region is input into a human face small region feature extraction network at 40. The face module includes two outputs: and the total human face output features and the human face fusion features.

In fig. 8, the face whole image feature extraction network includes a backbone network, and the face whole image is input to the backbone network to extract the whole face feature; the framework network mainly comprises 4 downsampling submodules named as Sefuse _ Net, 5 convolution submodules named as SSE _ layer and a Global average pooling layer (GAP). The downsampling sub-module may also be called a Sefuse _ Net sub-module, and the convolution sub-module may also be called an SSE _ layer sub-module.

Referring to fig. 9, the sefuse _netsub-module mainly includes 3 branches and a Se _ layer, where the 3 branches are: the 1 st branch consists of an average pooling layer +1 × 1 convolutional layer; the 2 nd branch consists of 3 x3 convolutional layers +3 x3 convolutional layers, the 3 rd branch consists of 3 x3 convolutional layers +1 x1 convolutional layers, and the 2 nd branch and the 3 rd branch share one 3 x3 convolutional layer. The Se _ layer adopts a Senet module structure to increase the channel attention of the Sefuse _ Net submodule. The Sefuse _ Net sub-module adopts a three-branch architecture, so that the number of channels and different convolution scales are increased, and the model can learn and converge better and faster. The Se layer includes a global average pooling layer, a1 × 1 convolution layer, and a Sigmiod function layer.

The SSE layer sub-module is also mainly composed of three branches: the 1 st branch is a1 x1 convolution layer; the 2 nd branch is a 3 x3 convolution layer; the 3 rd branch is a Se layer. The SSE _ layer submodule increases the model depth to extract deeper image features. The Sefuse _ Net submodule and the SSE _ layer submodule both use a Mish function as an activation function, increase the nonlinear relation of a model and help to extract nonlinear characteristics.

Referring to fig. 8 again, the face feature extraction network and the face small-region feature extraction network both use the same skeleton network as the face whole image feature extraction network, and the difference is that in the face feature extraction network and the face small-region feature extraction network, 2 fusion attention modules are added between the 4 th Sefuse _ Net sub-module and the 1 st SSE _ layer sub-module, and named as Fuse _ Net, and the fusion attention module may also be named as Fuse _ Net module.

Please refer to fig. 10, which illustrates a human faceThe Fuse _ net module in the piece feature extraction network is taken as an example, and the input of the Fuse _ net module is provided with two sources: (1) The feature maps output by the fourth Sefuse _ Net module in the face part feature extraction network are named as 'face part feature maps', the number of the feature maps is 9, and x is used for extracting the feature maps _BP HWC represents; (2) The feature map output by the fourth Sefuse _ Net module in the face whole image feature extraction network is cut into 9 corresponding face part feature maps, named as 'part feature map after face whole feature map cutting', and x is used _GBP HWC. Inputting 'face part characteristic diagram' and 'part characteristic diagram cut from face integral characteristic diagram' into Fuse _ net module to obtain updated face part characteristic diagram, using x _NBP HWC. It is further added that the meanings of other symbols in fig. 10 are as follows: f _M Expression of attention feature extraction equation, X _M HW X1 denotes an attention profile, X _C 1X 2C represents the extracted channel characteristics, X _S HW X2C represents a spatial attention weighted feature map, X _S1 HWC denotes the feature after compression 1, X _V 1 × 1 × 2C represents the channel attention vector,

_S HW × 2C denotes the channel attention weighted profile, X _S2 HWC represents the feature map 2 after compression. F _sq Expression of channel feature extraction equation, F _s Representing the channel compression equation, F _ex Representing the channel attention extraction equation.

Similarly, for the Fuse _ net module in the face small region feature extraction network, the input of the Fuse _ net module has two sources: (1) The feature maps output by a fourth Sefuse _ Net module in the face small region feature extraction network are named as 'face small region feature maps', and the number of the feature maps is 40; (2) The feature map output by the fourth Sefuse _ Net module in the face whole image feature extraction network is cut into 40 corresponding face small feature maps, and the cut face small feature maps are named as 'small region feature maps obtained after face whole feature map cutting'. And inputting the 'human face small region feature map' and the 'small region feature map cut from the human face overall feature map' into a Fuse _ net module to obtain an updated human face small region feature map. The Fuse _ net module adopts a space attention mechanism and a channel attention mechanism, further pays attention to important features to face parts and small face areas, and effectively supplements and corrects feature extraction deviation caused by face alignment and face key point positioning error.

the obtaining of the total output features of the human face according to the first overall human face features, the first human face part region features and the first human face region comprises:

the obtaining of the face fusion feature according to the first output feature map, the second output feature map and the third output feature map includes:

Referring to fig. 8 again, after the face part feature extraction network and the face small region feature extraction network extract features in the framework network, average values (Avg) of the 9 obtained face part features and the 40 obtained face small region features are respectively obtained to obtain face part features and face small region features, and then the face part features and the face small region features are connected in series with the overall face features to obtain total face output features.

In fig. 8, a first output feature map, the second output feature map, and the third output feature map are respectively obtained from the 4 th Sefuse _ Net sub-module of the face whole image feature extraction network, the face part feature extraction network, and the face small region feature extraction network. Firstly, obtaining human face overall features, 9 human face part features and 40 human face small region features by respectively passing the three feature maps through a global average pooling layer; then, the obtained 9 individual face part features and 40 individual face small region features are respectively averaged (Avg) to obtain face part features and face small region features, and then the face part features and the face small region features are connected with the overall face features in series to obtain face fusion features.

To sum up, the face module obtains two output features: the total human face output characteristic and the human face fusion characteristic are used as two input characteristics of the fusion module. The total output features of the human face represent the information of the human face in the mode, the human face fusion features are fused with other modal features such as psychological features and background features in a fusion module, the relationship among the modes is input into the model, and the recognition capability of the model on the violence tendency is improved. Similarly, a psychological module, a crowd background module and other modal modules all adopt a design of two-branch output, wherein one branch represents the characteristics of the modal self, and the other branch is used for fusing the characteristics of each modal.

In one embodiment, the multi-modal tendue recognition model includes further comprising: a psychological module and a crowd background module;

calculating various psychological characteristics through the first fully-connected network branch and the second fully-connected network branch to respectively obtain a psychological total output characteristic and a psychological fusion characteristic;

In an embodiment, the first, second, third and fourth, fifth and sixth fully-connected network branches comprise a first fully-connected layer and a second fully-connected layer, respectively, the first fully-connected layer being connected with the second fully-connected layer.

Referring to fig. 11, a two-branch parallel two-Layer Fully Connected Layer (FC) network architecture is constructed, where FC1+ FC2 represents a first Fully Connected network branch, and a psychometric total output feature is extracted, where a dimension of the psychometric total output feature is the same as a dimension of a face total output feature; FC3+ FC4 represents a second fully-connected network branch, and psychometric fusion features are extracted, wherein the dimensionality of the psychometric fusion features is the same as the dimensionality of the human face fusion features.

The human face feature, the psychological characteristic and the relevant crowd background feature are input into the network of the fusion module, so that the dimension number of the features of each module is unified.

Please refer to fig. 12, where FC5+ FC6 represents a third fully-connected network branch, and FC7+ FC8 represents a fourth fully-connected network branch, where the third fully-connected network branch extracts the crowd background total output feature with the same dimension as the face total output feature; and extracting relevant crowd background fusion characteristics through the fourth fully-connected network branch, wherein the dimensionality of the relevant crowd background fusion characteristics is the same as that of the human face fusion characteristics.

In this embodiment, the face module and the mental module, the crowd background total output feature and the crowd background fusion feature, will also be used as two input features of the fusion module, and because the face feature, the mental feature and the related crowd background feature are input to the network of the fusion module in the fusion module, the dimensions of the features of each module will be unified.

In one embodiment, the multi-modal violence tendency recognition model further includes a fusion module and a classification module, and the multi-modal violence tendency recognition model is used for calculating the whole face image, the M individual face part area and the L individual face part area of each image to be detected of each type of people, the psychological characteristics of each type of people and the background characteristics of each type of people, and includes:

calculating the face fusion feature, the psychological fusion feature and the crowd background fusion feature respectively through the fusion module to respectively obtain a first output feature, a second output feature and a third output feature; taking a feature mean of the first output feature, the second output feature and the third output feature as an overall fused output feature; connecting the human face total output characteristic, the psychological total output characteristic, the crowd background total output characteristic and the total fusion output characteristic in series to obtain a final output characteristic, and inputting the final output characteristic into a full connection layer to obtain a comprehensive characteristic;

carrying out violence tendency classification on the comprehensive characteristics through the classification module, wherein the violence tendency types comprise: the patient has a violent tendency and a non-violent tendency.

Therefore, the face overall image, the M individual face part area and the L individual face small area of each image to be detected of various types of people can be comprehensively considered, the psychological characteristics of various types of people and the background characteristics of various types of people are classified into violence tendency, and the accuracy of violence tendency classification is improved.

In addition, if other modal data exist, other modal modules can be added, other modal fusion characteristics and other modal total output characteristics are obtained through the other modal modules, and then the human face whole image, the M personal face part area and the L personal face small area of each image to be detected of each type of people in the front are combined to carry out violence tendency classification, and the psychological characteristics of each type of people and the background characteristics of each type of people are further improved in accuracy of violence tendency classification. Other modality data may include various types of crowd audio data, various types of crowd behavior data, and the like, without limitation.

the other modality module includes a fifth fully connected network branch and a sixth fully connected network branch;

the method further comprises the following steps:

calculating other modal characteristics through the fifth fully-connected network branch and the sixth fully-connected network branch respectively to obtain other modal fusion characteristics and other modal total output characteristics respectively;

calculating the face fusion feature, the psychological fusion feature, the crowd background fusion feature and the other modality fusion feature respectively through the fusion module to respectively obtain a first output feature, a second output feature, a third output feature and a fourth output feature; taking a feature mean of the first output feature, the second output feature, the third output feature, and the fourth output feature as an overall fused output feature; connecting the human face total output feature, the psychological total output feature, the crowd background total output feature, the other modal total output feature and the total fusion output feature in series to obtain a final output feature, and inputting the final output feature into a full connection layer to obtain a comprehensive feature;

In this embodiment, the fusion module includes a convolutional neural network sub-model, and the convolutional neural network sub-model includes J convolutional neural networks. The processing procedure of the fusion module can be realized by J graph convolution neural networks.

It should be added that the other modality modules are extensible modules, which may be added or deleted according to actual situations, and the number of the other modality modules may be determined according to actual situations, for example, may be 2, 5, and the like, and is not limited herein. If the researched crowd has other modal data, the face module, the psychological module and the crowd background module can be referred to construct other modal modules, for example, an audio modal module and a behavior modal module are respectively constructed aiming at various crowd audio data and various crowd behavior data. And acquiring other modal total output features and other modal fusion features corresponding to other modal data through other modal modules, ensuring that the forms and dimensions of the generated other modal total output features and other modal fusion features are the same as those of the features generated by the modules such as the face module, the psychological module and the like, and fusing the other modal total output features and other modal fusion features into an integral model. And if no other modality data exists, the other modality module is not considered.

Referring to fig. 13, a face module, a psychological module, a related population background module, and other modalities respectively process a face picture, psychological data, related population background data, and other modality data to obtain a total output feature of the face, a total output feature of the psychological data, a total output feature of the population background, and a total output feature of other modalities (with the dimensions being ensured to be consistent), the total output feature of the face, the total output feature of the psychological data, and the total output feature of other modalities are input as four nodes of a graph convolution neural network, the features in 4 nodes are updated through 5 to 8 graph convolution neural networks, and the features of the face module, the psychological module, the related population background module, and other modality modules are fused to enhance feature expression.

The fusion module in fig. 13 includes J graph convolution neural networks, and averages (Avg) the first output characteristic, the second output characteristic, the third output characteristic, and the fourth output characteristic output by the J graph convolution neural networks to obtain an overall fusion output characteristic; and finally, after the face total output characteristic, the psychological total output characteristic, the crowd background total output characteristic, the other modal total output characteristic and the total fusion output characteristic are connected in series, the face total output characteristic, the psychological total output characteristic, the crowd background total output characteristic, the other modal total output characteristic and the total fusion output characteristic are sent into a full connection layer (FC) for classification, and the violence tendency is identified as the non-violence tendency.

The embodiment provides a novel deep learning-based multi-modal universal framework, which is used for objectively and efficiently evaluating the violence tendency by using a face image, psychological assessment data and related crowd background information and providing an violence identification auxiliary tool for different crowds. The method combines the overall face features with the local features, provides a dividing basis for taking the prior knowledge as the local small regions of the face, guides the local interested small regions of the face to be extracted, and effectively integrates the prior knowledge into a multi-mode violence tendency recognition model; meanwhile, the relationship between different face part areas and local face small areas is used as a supplementary feature to be integrated into the model, and the evaluation performance of the multi-modal violence tendency recognition model is improved together. The method considers the measurement data of the psychological state and the background information of different crowds at the same time, analyzes the violence tendency in different dimensions, integrates various modal data into a multi-modal violence tendency recognition model, can flexibly increase various different modal information, has better performance, robustness and generalization capability, and has higher applicability and wide applicable crowds.

The method for identifying the violence tendency based on the multi-mode obtains N images to be detected of various crowds, and obtains a whole face image, an M personal face part area and an L personal face part area of each image to be detected of the various crowds; acquiring psychological evaluation data of various types of people, and acquiring overall psychological characteristics according to the psychological evaluation data of various types of people; acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics; and calculating the whole face image, the M individual face part area, the L individual face small area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds by using a multi-mode violence tendency recognition model to obtain a violence tendency recognition result. The method is suitable for various crowds, the face overall image, the M individual face part area, the L individual face part area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds are calculated through the multi-mode violence tendency recognition model, violence tendency is recognized on the basis of multi-mode data, objectivity and accuracy of violence tendency recognition are improved, and robustness and generalization ability are improved.

Example 2

In addition, the embodiment of the disclosure provides a device for recognizing violence tendencies based on multiple modes.

Referring to fig. 14, a multi-modal violence-tendency-based recognition apparatus 1400 includes:

a first obtaining module 1401, configured to obtain N images to be detected of various types of people, and obtain a whole face image, an M-person face part region, and an L-person face small region of each image to be detected of various types of people;

the second obtaining module 1402 is configured to obtain various types of crowd psychological assessment data, and obtain various types of crowd psychological characteristics according to the various types of crowd psychological assessment data;

a third obtaining module 1403, configured to obtain various types of crowd background information, and perform format adjustment processing on the various types of crowd background information to obtain various types of crowd background features;

the recognition module 1404 is configured to calculate, through a multi-modal violence tendency recognition model, a whole face image, an M-person face part region, an L-person face part region, psychological characteristics of various types of people, and background characteristics of various types of people of each image to be detected of various types of people, so as to obtain a violence tendency recognition result.

In an embodiment, the first obtaining module 1401 is further configured to perform face detection on each image to be detected, so as to obtain a face frame and a face key point of each image to be detected;

In an embodiment, the second obtaining module 1402 is further configured to calculate a multidimensional psychological score of the psychological evaluation data of each category of people according to a scoring rule of the psychological evaluation tool of each category of people;

the third obtaining module 1403 is further configured to extract general population informatics data and population unique information data of various types of populations from the background information of various types of populations;

the recognition module 1404 is further configured to calculate, through the whole face image feature extraction network, a whole face image of each to-be-detected image of each crowd so as to extract a first whole face feature;

calculating L individual face small areas of the images to be detected of various groups of people through the face small area feature extraction network to extract the features of the first individual face small area;

acquiring a total face output feature according to the first overall face feature, the first face part region feature and the first face cell;

and acquiring the face fusion feature according to the first output feature map, the second output feature map and the third output feature map.

In an embodiment, the recognition module 1404 is further configured to calculate, through the face part region feature extraction network, M personal face part regions of each image to be detected of each category of people, so as to extract M personal face part region features of each image to be detected of each category of people;

calculating the L personal small face areas of the images to be detected of all kinds of people through the human face small area feature extraction network so as to extract the L personal small face areas of the images to be detected of all kinds of people;

respectively passing the first output feature map, the second output feature map and the third output feature map through a global average pooling layer to obtain a second face overall feature, M second face part regional features and L second face small regional features;

the psychology module comprises a first fully connected network branch and a second fully connected network branch;

the identifying module 1404 is further configured to calculate the various types of crowd psychological characteristics through the first fully connected network branch and the second fully connected network branch, so as to obtain a psychological total output characteristic and a psychological fusion characteristic, respectively;

the identifying module 1404 is further configured to calculate other modality characteristics through the fifth fully connected network branch and the sixth fully connected network branch, so as to obtain other modality fusion characteristics and other modality total output characteristics, respectively;

calculating the face fusion feature, the psychological fusion feature, the crowd background fusion feature and the other modal fusion features respectively through the fusion module to obtain a first output feature, a second output feature, a third output feature and a fourth output feature respectively; taking a feature mean of the first output feature, the second output feature, the third output feature, and the fourth output feature as an overall fused output feature; connecting the human face total output characteristic, the psychological total output characteristic, the crowd background total output characteristic and the total fusion output characteristic in series to obtain a final output characteristic, and inputting the final output characteristic into a full connection layer to obtain a comprehensive characteristic;

In an embodiment, the first, second, third, fourth, fifth and sixth fully-connected network branches comprise a first fully-connected layer and a second fully-connected layer, respectively, the first fully-connected layer being connected with the second fully-connected layer.

The device 1400 based on multi-modal violence tendency provided in this embodiment can implement the method based on multi-modal violence tendency provided in embodiment 1, and is not described herein again to avoid repetition.

The device for identifying the violence tendency based on the multi-mode obtains N images to be detected of various crowds, and obtains a whole face image, an M personal face part area and an L personal face part area of each image to be detected of the various crowds; acquiring psychological evaluation data of various groups, and acquiring psychological characteristics of various groups according to the psychological evaluation data of various groups; acquiring various types of crowd background information, and performing format adjustment processing on the various types of crowd background information to obtain various types of crowd background characteristics; and calculating the whole face image, the M individual face part area, the L individual face small area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds by using a multi-mode violence tendency recognition model to obtain a violence tendency recognition result. The method is suitable for various crowds, the face overall image, the M individual face part area, the L individual face part area, the various crowd psychological characteristics and the various crowd background characteristics of each image to be detected of various crowds are calculated through the multi-mode violence tendency recognition model, violence tendency is recognized on the basis of multi-mode data, objectivity and accuracy of violence tendency recognition are improved, and robustness and generalization capability are improved.

Example 3

Furthermore, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the computer program executes the multi-modal violence-tendency-based recognition method provided in embodiment 1 when the computer program runs on the processor.

The electronic device provided in this embodiment may implement the method for identifying based on the multi-modal violence tendency provided in embodiment 1, and details are not described here again to avoid repetition.

Example 4

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the multi-modal violence-based propensity recognition method provided in embodiment 1.

In this embodiment, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The computer-readable storage medium provided in this embodiment may implement the method for identifying a multi-modal violence tendency provided in embodiment 1, and is not described herein again to avoid repetition.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or terminal comprising the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for identifying a multi-modal violence propensity, the method comprising:

acquiring N images to be detected of various groups of people, and acquiring a whole face image, an M individual face part area and an L individual face small area of each image to be detected of various groups of people;

2. The method according to claim 1, wherein the obtaining of the whole face image, the M-person face part region and the L-person face part region of each image to be measured of each type of people comprises;

3. The method according to claim 1, wherein the obtaining of various types of human psychographic characteristics from the various types of human psychographic evaluation data comprises:

4. The method of claim 1, wherein the multi-modal violence propensity recognition model comprises face modules comprising a face whole image feature extraction network, a face component region feature extraction network, and a face small region feature extraction network;

calculating M personal facial part areas of each image to be detected of various groups of people through the facial part area feature extraction network to extract the first facial part area feature;

5. The method according to claim 4, wherein the computing, by the face part region feature extraction network, the M individual face part regions of each image to be tested of each category of people to extract first face part region features comprises:

calculating an average value of the extracted small face areas to obtain the characteristics of the first small face area;

connecting the first overall face feature, the first facial part region feature and the first face cell feature in series to obtain a total face output feature;

6. The method of claim 4, wherein the multi-modal tendue recognition model further comprises: a psychological module and a crowd background module;

7. The method of claim 6, wherein the multi-modal tendue recognition model further comprises other modality modules, fusion modules, and classification modules;

the method further comprises the following steps:

calculating the face fusion feature, the psychological fusion feature, the crowd background fusion feature and the other modality fusion feature respectively through the fusion module to respectively obtain a first output feature, a second output feature, a third output feature and a fourth output feature; taking a feature mean of the first output feature, the second output feature, the third output feature, and the fourth output feature as an overall fused output feature; connecting the human face total output characteristic, the psychological total output characteristic, the crowd background total output characteristic, the other modal total output characteristic and the total fusion output characteristic in series to obtain a final output characteristic, and inputting the final output characteristic into a full connection layer to obtain a comprehensive characteristic;

8. A multi-modal violence-propensity-based recognition apparatus, the apparatus comprising:

the second acquisition module is used for acquiring psychological evaluation data of various types of people and acquiring psychological characteristics of various types of people according to the psychological evaluation data of various types of people;

9. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, performs the multi-modal propensity to violence based recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when run on a processor, executes the method for multi-modal propensity to violence based recognition according to any one of claims 1 to 7.