CN115240127A

CN115240127A - Smart television-oriented child monitoring method

Info

Publication number: CN115240127A
Application number: CN202210623199.8A
Authority: CN
Inventors: 林盛鑫; 刘华珠; 陈雪芳; 赵晓芳; 欧超超
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-10-25

Abstract

The invention discloses a child monitoring method facing a smart television, which comprises the following steps: the intelligent television automatically identifies an expression identification algorithm and a falling detection algorithm of the child; during the playing of the smart television, when children jump around on sofas or tables, danger may occur, and at the moment, the system sends out a prompt to parents to pay attention to safety through app application; when the children watch the frightened content in the program watching process, the expression of the children can be automatically identified by the algorithm, the programs are automatically switched or the television is turned off, and meanwhile, the system can send out a prompt to parents through the app application. When the child falls down for some reasons in the living room, the system can identify the fall of the child and further judge: if the expression of the child is painful or crying, immediately sending an alarm to notify parents by the app; if the child does not cry after falling down, or the expression is normal, only one record can be made in the system, and the parents can check the falling record of the child in a falling monitoring module in the app application.

Description

Smart television-oriented child monitoring method

Technical Field

The invention relates to an image segmentation method, in particular to a child monitoring method facing to a smart television, and belongs to the field of image processing.

Background

The intelligent furniture is based on modern fashion furniture, and skillfully integrates combined intelligence, electronic intelligence, mechanical intelligence and thing connection intelligence into furniture products, so that the furniture is intelligentized, internationalized and fashionable, the home life is more convenient and comfortable, the intelligent furniture is an important component of a new and precious life style, and the intelligent furniture is a development trend and trend of future international home.

The smart television is a television product which is based on an Internet application technology, has an open operating system and a chip, has an open application platform, can realize a bidirectional man-machine interaction function, integrates various functions such as audio and video, entertainment, data and the like, and meets the diversified and personalized requirements of users. The purpose is to bring more convenient experience to the user.

The smart television is a typical representative of smart home. The smart television is a brand new generation television which is continuously growing and is advanced with the current time. The most important thing of the smart television is that a fully-open platform must be carried, and only through the fully-open platform, the fully-open platform can be used for widely starting the function formulation of the color television in which the consumers actively participate, so that the 'demand customization' and 'entertainment of the color television' can be realized, and the fully-open platform is the only effective way for solving the intelligent development of the color television.

In the natural and active age, if the children see bad contents in the television, physical and psychological health can be influenced; meanwhile, the living room is used as a main activity space of the child, and if the child falls down in the living room, parents find the child untimely, the treatment time of the child is delayed, and the like, secondary injury of the child can be caused.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a child monitoring method facing a smart television, which realizes that the smart television automatically identifies an expression identification algorithm and a falling detection algorithm of a child, so that a system can remind parents or a system to make a record through app according to the identified algorithm.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a child monitoring method facing to a smart television is characterized in that: the method comprises an expression recognition algorithm and a falling detection algorithm for automatically recognizing children by the smart television;

the flow of the expression recognition algorithm is as follows:

the method comprises the following steps: resNet-18 (backbone network) extracts the characteristics of basic CNN (convolutional neural network);

step two: FDN (function decomposition network) decomposes the basic features into a set of facial action-aware latent features;

step three: an FRN (function reconstruction network) for learning intra-feature relationship (intra-feature relationship) weight and inter-feature relationship (inter-feature relationship) weight of each potential feature and then constructing the overall expression feature;

step four: the EPN (expression prediction network) is used for predicting the expression labels corresponding to the input pictures;

the fall detection algorithm flows as follows:

the method comprises the following steps: taking the whole image as the input of CNN (convolutional neural network) to carry out joint prediction;

step two, mapping confidence of human body part detection;

step three: associating human body parts with PAFs (affinity domains);

step four: the parsing step performs a set of two-part matches to associate text part candidates;

step five: they are combined into a full body pose for use by all people in the image.

Further, the second flow step of the expression recognition algorithm specifically comprises the following steps: for a given ith face image, the underlying features extracted by ResNet-18 (backbone network) are defined as x _i ∈R ^P Where P denotes the dimension of the underlying feature, FDNFDN (function decomposition network) decomposes the underlying feature into a series of facial action-aware latent features, let L be _i ＝[I _i，1 ，I _i，2 ，...，I _i，M ]∈R ^D×M It represents a late (latent) feature matrix, where I _i，j Representing jth (potential) features of the ith human face image, D representing the dimension of each feature, and M representing the number of features;

the method adopts a mode of adding an activation function to a full connection layer to extract corresponding late (potential) features, and a specific expression is as follows:

wherein

Representative is the weight parameter, σ, of the fully-connected layer ₁ Representative is the ReLU activation function;

compact Loss learns the centers of the same late (potential) features and calculates the distance between these late (potential) features and their corresponding centers, as follows:

wherein N is represented asNumber of pictures in one mini-batch, c _j Represented is the center of the jth type of latent (latent) feature.

Further, the expression recognition algorithm comprises the following three specific steps: acquiring FRN (feature reconstruction network) of a network with different expression features comprises an Intra-RM (inter-feature relational modeling), wherein the Intra-RM (inter-feature relational modeling) is composed of a plurality of Intra-feature correlation modeling blocks, the blocks (blocks) establish the relationship between Intra-features and feature elements, each block (block) is composed of a full connection layer and a sigmoid (nonlinear) activation function, and the specific expression is as follows:

wherein alpha is _i，j Represented is the weight of the correlation between the jth type latent feature and the ith human face image,

representative of the parameters of the fully-connected layer, σ ₂ Representative is the sigmoid (non-linear) activation function;

by expression (3), α is calculated _i，j And it is used as Intra-feature relationship Weight (Intra-feature Weight) for determining the importance of j-th type (potential) feature, and the specific expression is as follows:

α _i，j ＝||α _i，j || ₁ for j＝1，2，…，M， (4)

learning centers of various expressions and calculating distances between Intra-Ws (Intra-feature relationship weights) from the same category and corresponding centers by using Distribution Loss module (I-th facial image) under the assumption that the ith facial image belongs to the kth _i Class expressions, then the expression of distribution loss is as follows:

w _i ＝[α _i，1 ，α _i，2 ，…，α _i，M ] ^T the representative is the Intra-W (Intra-feature weight) vector of the ith human face picture, and the representative is the kth human face picture _i A category center of the similar expression;

balance Loss module (Balance Loss module) balances the distribution of Intra-W (Intra-feature relational weight) vectors, as expressed below:

represented by a vector of the mean of Intra-W (Intra-feature relationship weights) in the sample of batch,

representing a uniformly distributed weight vector;

after calculating Intra-W (inner relation weight) features of each parent (potential) feature, multiplying the weights by the corresponding features to obtain the corresponding Intra-aware (inner perception) features of the ith picture, wherein the specific expression is as follows:

f _i，j ＝α _i，j l _i，j for j＝1，2，…，M， (7)

wherein f is _i，j Represents the j intra-aware feature of the ith picture

Further, the expression recognition algorithm comprises the following three specific steps: acquiring a network FRN (feature reconstruction network) with different expression features comprises an Inter-RM (Inter-feature relationship modeling module) which learns a set of relationship information and estimates Inter-Ws (internal workflow) between the information, and firstly, f is divided into _i，j Inputting into an information network for feature coding, wherein the information network is composed of a full connection layer and a ReLU (Linear) activation functionThe specific expression is as follows:

wherein

Expressed is the weight parameter, σ, of the fully connected layer ₁ Represented is the ReLU (Linear) activation function, g _i，j The j relation information of the ith picture is shown;

splicing corresponding relation information into a relation information matrix G _i ＝[g _i，1 ，g _i，2 ，…，g _i，M ]The matrix is used as a graph G (G) _i E) node, G being a complete undirected graph, E representing the relationships between different relationship messages, ω _i (j, m) is Inter-W, which represents node g _i，j And node g _i，m The specific calculation formula of the relationship importance between the two is as follows:

wherein g is _i，j And g _i，m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g _i，j And g _i，m The similarity function is a Euclidean distance function, and because the calculation results of S are all non-negative, a tanh (hyperbolic tangent) function sigma is further adopted ₃ Normalizing the distance value to the interval of [0,1);

an Inter-W matrix W can be obtained according to equation (9) _i ＝{ω _i (j，m)}。

The jth inter-aware feature of the ith picture can be expressed in the form:

combining the jth inter-aware feature with the jth intra-aware feature to obtain the jth import-aware feature of the ith human face image, wherein a specific expression is as follows:

where δ is a regularization coefficient used to balance inter-aware features with intra-aware features;

accumulating a series of import-aware features to obtain a final expression feature, which is expressed as follows:

wherein y is _i The expression feature of the ith human face picture is shown.

Further, the ResNet-18 (backbone network), FDN (function decomposition network), FRN (feature reconstruction network), and EPN (expression prediction network) are uniformly trained in an end-to-end manner, and the whole network minimizes the following joint loss function, and the expression is as follows:

wherein

Respectively representing classification loss, compactness loss, balance loss, and distribution loss, and the cross-entropy loss is used as the classification loss, lambda ₁ ，λ ₂ ，λ ₃ Is a regularization coefficient.

Further, a specific step of the fall detection algorithm is the convolutional neural network, and detection of the confidence map is realized by iteratively predicting affinity fields (similarity fields) which encode key points from key point to key point in a network architecture.

Further, the second step of the fall detection algorithm is to analyze the image through CNN (convolutional neural network) to generate a set of feature maps F input to the first stage, at this stage, the network generates a set of partial definition fields (PAFs) (part affinity fields); l is ¹ ＝φ ¹ (F)，φ ¹ Referring to CNNs (neural networks) that make inferences at stage 1, in each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to refine the predictions,

wherein phi is ^t Is a CNNs (neural networks) used for reasoning about the T phase, T _P Is the total number of PAF stages; at T _P After the second iteration, the confidence map detection process is repeated starting with the most recent PAF prediction,

where ρ is ^t Meaning CNNs (neural networks), T, used for reasoning during phase T _C Is the total number of confidence map stages;

the loss functions are spatially weighted by applying the loss functions at the end of each phase, in particular phase t _i And stage t of the PAF branch of _k The penalty function of the confidence mapping branch of (1) is as follows:

wherein the content of the first and second substances,

is the groudtruth PAF (PAF true),

is a group transit (true) part confidence map, W is a binary mask of W (p) =0 when there is no label at pixel p, the mask is to avoid penalizing prediction of true positive during training, the intermediate supervision of each stage solves the gradient vanishing problem by periodically supplementing the gradient, with the overall goal as follows:

further, the third specific step of the fall detection algorithm is a key point detection confidence map, in order to solve f in the formula (6) in the training process _S Generating a group truth confidence map S by using the labeled two-dimensional key points ^* Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k

Is the position of the group route (true value) of the body key point j of each person k in the image, with the position p at

The value of (1) can be defined as:

where σ controls the spread of the peak. The network predicted grouped truth confidence map is the aggregation of the single confidence map by the largest operator,

further, the specific step of the fall detection algorithm is PAFs (part affinity fields) for keypoint association, given a set of detected body parts, each pair of body part detections has associated confidence, i.e. they belong to the same person, and the PAFs (part affinity fields) stores position and direction information of the entire limb support area.

Furthermore, the specific step five of the fall detection algorithm is to select the minimum number of edges to obtain the spanning tree skeleton of the human body posture, and provides a greedy relaxation method which can continuously generate high-quality matching, predict two types of 'falling' or 'not falling' by supporting multi-camera and multi-person tracking and an LSTM (long-short memory) neural network, thereby enhancing the estimation of the human body posture, extract five time and space characteristics from the posture, and process the characteristics by an LSTM LSTM (long-short memory) classifier.

Compared with the prior art, the invention has the following advantages and effects:

1. the invention provides a child monitoring method facing an intelligent television, wherein an expression recognition algorithm firstly decomposes a face picture into different face actions, takes the face actions as basic units, and then carries out fusion according to the relation between the different face actions to obtain the emotion expression of the whole face picture.

2. The invention provides a child monitoring method facing a smart television, which is a fall detection algorithm, is based on OpenPose (posture evaluation) human action recognition, and predicts two types of falling or non-falling through supporting multi-camera and multi-person tracking and an LSTM (long-short-term memory) neural network, thereby enhancing human posture estimation. From the pose, we extracted five temporal and spatial features and processed through an LSTM (long-short-term memory) classifier.

3. Through the expression recognition algorithm and the falling detection algorithm, when children jump on sofas or tables in disorder, danger can possibly occur, and at the moment, the system can send out a prompt to parents to pay attention to safety through app (application);

4. through the expression recognition algorithm and the falling detection algorithm, if the children see frightened contents in the process of watching programs, the expressions of the children can be automatically recognized by the algorithm, the programs are automatically switched or the television is turned off, and meanwhile, the system sends out a prompt to parents through the app (application).

5. Through expression recognition algorithm and fall detection algorithm, children fall down (such as running and jumping randomly, floor wet and slippery) for some reasons in the range of the living room, the system can recognize the fall of children, and further judge: if the expression of the child is painful or crying, the app (application) immediately sends an alarm to notify the parents; if the child does not cry after falling down or the expression is normal, only one record is made in the system, and the parents can check the falling record of the child through a falling monitoring module in the app (application).

Drawings

FIG. 1 is an overall framework of the present invention;

FIG. 2 is a frame diagram of an expression recognition algorithm of the present invention;

FIG. 3 is a first facial action expression feature result graph of the present invention;

FIG. 4 is a second facial motion expression feature diagram of the present invention;

fig. 5 is a general flow chart of a fall detection method of the invention;

fig. 6 is a network architecture diagram of a fall detection algorithm of the invention;

fig. 7 is a body part map of fall detection of the invention;

fig. 8 is an arm limb diagram of fall detection of the present invention;

fig. 9 is a body posture diagram of the fall detection algorithm of the invention.

Detailed Description

The present invention is further illustrated by the following examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

According to the child monitoring method facing the smart television, as shown in fig. 1, the method comprises an expression recognition algorithm and a fall detection algorithm for automatically recognizing children by the smart television;

the flow of the expression recognition algorithm is as follows:

the fall detection algorithm flows as follows:

step two, mapping confidence of human body part detection;

step three: associating human body parts with PAFs (affinity domains);

The expression recognition algorithm, because different emotions contain the same face action, the combination of the face action is different between different emotions, such as Surprised, fear, anger, happy, angry, sad, and angry. In order to better mine the relation between the human face actions and the facial expressions, the facial image is firstly decomposed into different human face actions in the expression recognition algorithm, the human face actions are used as basic units, and then the emotional expression of the whole facial image is obtained by fusion according to the relation between the different human face actions. As shown in fig. 2-4, the expression recognition algorithm uses a facial expression recognition method based on feature deconstruction and reconstruction learning, which is respectively used for learning shared information and specific information in expression information, so as to extract more significant expression features.

As shown in fig. 5-9, an OpenPose (open posture) human posture recognition project is an open source library developed by a U.S. CMU (university of kanji melong) based on a convolutional neural network and supervised learning and using caffe (frame) as a frame, can realize posture estimation of human body actions, facial expressions, finger motions and the like, is suitable for single people and multiple people, has excellent robustness, is the first real-time multi-person two-dimensional posture estimation application based on deep learning in the world, and based on the fact that the real-time multi-person two-dimensional posture estimation application is developed like spring bamboo shoots after rain, the human posture estimation technology has wide application prospects in the fields of sports fitness, action acquisition, 3D (three-dimensional) fitting, public opinion monitoring and the like, a more familiar application of people is a trembling and embarrassing machine, and human body action recognition based on OpenPose (open posture), and we propose fall monitoring.

In an optional implementation manner, the second specific step of the flow of the expression recognition algorithm is: for a given ith face image, the underlying features extracted by ResNet-18 (backbone network) are defined as x _i ∈R ^P Where P denotes the dimension of the underlying feature, FDNFDN (functional decomposition network) decomposes the underlying feature into a series of facial action-aware potential features. Let L _i ＝[I _i，1 ，I _i，2 ，...，I _i，M ]∈R ^D×M It represents a late (latent) feature matrix, where I _i，j Representing the jth (potential) feature of the ith face image, D representing the dimension of each feature, and M representing the number of features.

wherein

where N denotes the number of pictures in a mini-batch, c _j Representing the center of the jth class of latent features, updated once per mini-batch, a series of compact latent features can be learned more efficiently by minimizing the compact loss.

Since different facial expressions share a series of identical latent features, the jth latent feature extracted from one base feature should be similar to the latent feature extracted from another base feature, and therefore, the compact loss is designed. The latent features acquired by FDN (functional decomposition network) are more refined and feature that can sense facial actions, which will be used for the subsequent work of expressive feature extraction.

In an optional embodiment, the expression recognition algorithm includes three specific steps: acquiring FRN (feature reconstruction network) of a network with different expression features comprises an Intra-RM (feature-to-feature relational modeling), wherein the Intra-RM (feature-to-feature relational modeling) is composed of a plurality of Intra-feature correlation modeling blocks, the blocks (blocks) establish the relationship between Intra-features and feature elements, each block (block) is composed of a full connection layer and a sigmoid (nonlinear) activation function, and the specific expression is as follows:

by expression (3), Δ is calculated _i，j And it is used as Intra-feature relationship Weight (Intra-feature Weight) for determining the importance of j-th type (potential) feature, and the specific expression is as follows:

α _i，j ＝||α _i，j || ₁ for j＝1，2，…，M， (4)

distribution Loss module learns the centers of various types of expressions and calculates the distance between Intra-Ws (Intra-feature relationship weights) from the same class and the corresponding centers. Suppose that the ith human face image belongs to the kth _i Similar expressions, the expression of distribution loss is as follows:

w _i ＝[α _i，1 ，α _i，2 ，…，α _i，M ] ^T representative is the ith human faceThe Intra-W (Intra-feature weight) vector of the picture indicates the kth _i A category center of the similar expression;

represents a vector formed by the mean values of Intra-W (Intra-feature relationship weights) in a batch sample,

representing a uniformly distributed weight vector;

after calculating Intra-W (internal relation weight) features of each latent feature, multiplying the weights by the corresponding features to obtain the Intra-aware (internal perception) features of the corresponding ith picture, wherein the specific expression is as follows:

f _i，j ＝α _i，j l _i，j for j＝1，2，…，M， (7)

wherein f is _i，j Representative is the jth intra-aware feature of the ith picture.

The Distribution of the Intra-Ws of different images in the same expression category is as close as possible, so that the Distribution Loss module is similar to the compact Loss module, and the Distribution of the Intra-Ws of different images in the same expression category is closer by optimizing the Distribution Loss module, so that the Distribution of the Intra-Ws of different images in the same expression category can be focused on expression-specific.

In practice, we find that the Intra-Ws (Intra-relationship weights) related to multiple latent (potential) features generally show higher value than the Intra-Ws (Intra-relationship weights) of each picture, thus resulting in overall distribution imbalance, and therefore design Balance Loss module.

In an optional implementation manner, the expression recognition algorithm includes three specific steps: acquiring a network FRN (feature reconstruction network) with different expression features comprises an Inter-RM (Inter-feature relationship modeling module) which learns a set of relationship information and estimates Inter-Ws (internal workflow) between the information, and firstly, f is divided into _i，j Inputting an information network for feature coding, wherein the information network consists of a full connection layer and a ReLU (linear) activation function, and the specific expression is as follows:

wherein

Expressed is the weight parameter, σ, of the fully-connected layer ₁ Represented is the ReLU (Linear) activation function, g _i，j The j relation information of the ith picture is shown;

wherein g is _i，j And g _i，m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g _i，j And g _i，m The function of similarity, which is used in the present invention, is the Euclidean distance functionSince the calculated result of S is all non-negative, a tan h (hyperbolic tangent) function σ is further used ₃ Normalizing the distance value to the interval of [0,1);

The jth inter-aware feature of the ith picture can be expressed in the form:

wherein y is _i The expression feature of the ith human face picture is shown.

Wherein the Intra-W (Intra-feature relationship weights) of each parent (potential) feature can be learned in the Intra-RM (inter-feature relationship modeling module), but these weights are obtained independently. Although distribution loss module (distribution loss module) has been regularized to some extent for Intra-W (Intra-feature relationship weights), it still fails to fully consider the inter-relationship between the (potential) features. In fact, there are some facial movements that are relevant for different facial expressions. Therefore, it is also important for expression recognition to use related facial motion features of different expressions, and thus an Inter-RM (model for modeling relationship between features) was designed

An alternative embodiment, resNet-18 (backbone), FDN (functional decomposition network), FRN (feature reconstruction network), and EPN (expression prediction network) are uniformly trained in an end-to-end manner, minimizing the following joint loss function over the entire network, as follows:

wherein

By optimizing joint loss, facial expression recognition tasks can be extracted with more detailed facial expression features with different degrees.

As shown in fig. 4-5, in the visualization result, different human face action unit branches can focus on different actions of the face, and the final experimental result proves that the text provides a better result by using the human face action as a basic expression unit and adopting a fine-grained modeling method of decoupling first and fusing later.

In an alternative embodiment, a specific step of the fall detection algorithm is the convolutional neural network, and the detection of the confidence map is realized by iteratively predicting affinity fields (similarity fields) encoding from key point to key point in the network architecture.

Wherein the network architecture diagram is shown in FIG. 6The function of the blue color block is to iteratively predict the affinity fields that encode keypoints-to-keypoints. And in the green color block, the detection of the confidence map is realized. The prediction in successive stages is improved, T ∈ { 1., T }, and each stage has an intervening review. In the original approach the network architecture included several 7 x 7 convolutional layers. In the inventive model, 3 consecutive 3 × 3 convolution kernels are used instead of a single 7 × 7 convolution kernel, the number of operations of the former being 2 × 7 ² -1=97, the latter being only 51, and in addition, each of the 3 convolution kernels is cascaded, the number of nonlinear layers being increased by a factor of two, the network retaining both lower-level and higher-level features.

In an alternative embodiment, the second step of the fall detection algorithm is to analyze the image by CNN (convolutional neural network) to generate a set of feature maps F that are input to the first stage, at which stage the network generates a set of Partffenitifields (PAFs) (part affinity fields); l is ¹ ＝φ ¹ (F)，φ ¹ Referring to CNNs (neural networks) that make inferences at stage 1, in each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to refine the predictions,

wherein the content of the first and second substances,ρ ^t meaning CNNs (neural networks), T, used for reasoning in phase T _C Is the total number of confidence map stages.

The method of the invention refines PAF and confidence map branches (confidence map branches) in each stage, thereby achieving the effect of reducing the calculated amount by half in each stage. During subsequent experiments, improved affinityfields (similarity fields) were found to improve confidence map results, while the opposite was not true. Intuitively, if one looks at the PAF channel output, one can guess the position of the body part, but if one sees a pile of body key points without other information, one cannot resolve them into different people.

Wherein, in order to direct the network to iteratively predict the PAFs of the body part at a first branch and to iteratively predict the confidence level at a second branch, a loss function is applied at the end of each phase, the loss function being spatially weighted using (estimated predictions) and the L2 loss between the group route map and the field, in particular, the phase t _i And stage t of the PAF branch of _k The penalty function of the confidence mapping branch of (1) is as follows:

wherein the content of the first and second substances,

is the groudtruth PAF (PAF true),

is a group reliability map, W is a binary mask of W (p) =0 when there is a lack of labels at pixel p, the mask is to avoid penalizing prediction of true positive during training, in for each stageterm supervise solves the gradient disappearance problem by periodically supplementing the gradient, with the overall goal as follows:

in an alternative embodiment, the third specific step of the fall detection algorithm is a confidence map of key point detection, in order to find f in formula (6) during the training process _S Generating a group truth confidence map S by using the labeled two-dimensional key points ^* Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k

The value of (1) can be defined as:

wherein, for expression (7), in order to find f in expression (6) in the training process _S Generating a group truth confidence map S by using the labeled two-dimensional key points ^* Each confidence map is a feature of interestThe body-determining part may be a 2D representation of the beliefs in any given pixel. Ideally, if a person appears in the image, then there should be a peak in each confidence map if the corresponding keypoint is visible; if multiple people are present in the image, each visible keypoint j for each person k should have a peak.

Where, for expression (8), the maximum value in the confidence map is taken instead of the average value, so that the accuracy of the nearby peaks remains significant. At test time, a confidence map is predicted and body keypoint candidates are obtained by performing non-maximum suppression.

In an alternative embodiment, the specific step of the fall detection algorithm is PAFs (part affinity field) for keypoint association, given a group of detected body parts, with associated confidence for each pair of body part detections, i.e. they belong to the same person, the PAFs (part affinity field) storing position and orientation information for the entire limb support zone.

Where, as shown in fig. 7, how do a given set of detected body parts (as shown by the red and blue dots in fig. 7) make up them into an unknown number of people's full body positions? A confidence level is needed that is associated for each pair of body-part detections, i.e. they belong to the same person. One possible measure is to detect additional midpoints between each pair of components on the limb and check its incidence of detection at candidate components, as shown in fig. 7 (b). However, when people are crowded together, these midpoints are likely to support erroneous correlations (as shown by the green lines in fig. 7 (b)). Such error association occurs mainly for two reasons:

it only encodes the position of each limb, not the orientation of the limb

It reduces the support area of the limb to one point

PAFs (parts affinity fields) solve these problems, they hold position and orientation information for the entire limb support area (as shown in fig. 7 (c)), each PAF being a 2D vector field for each limb, the 2D vector encoding the direction pointing from one part of the limb to the other for each pixel in the area belonging to a particular limb, each limb having a corresponding PAF connecting its two related body parts.

Consider a simple limb arm, as shown in FIG. 8, with

And

is a body key point j of a limb c of a person k in the image ₁ And j ₂ The group channel (true) position of (c). If a point p falls on the limb, then

From j of ₁ To j ₂ A unit vector of (a); for other points, the vector value is 0.

To calculate f in equation 6 during training _L We define a grountruth PAF (true PAF) at point p,

comprises the following steps:

in this case, the amount of the solvent to be used,

is the unit vector of the limb direction. The set of points on the limb is defined as the set of points within the distance threshold of the line segment, i.e., for the set of points p within the distance threshold of the line segment,

wherein the width of the limb σ _l Is the pixel distance, the length of the limb

v _⊥ Is a vector perpendicular to v.

The groudtuth PAF (true PAF) averages the affinity fields of all people in the image,

wherein n is _c (p) is the number of non-zero vectors at p for all k individuals.

During testing, the correlation between candidate part detections is measured by computing line integrals over the respective PAFs along line segments connecting the candidate keypoint locations. In other words, the alignment of the predicted PAF with the candidate limb formed by connecting the detected body parts is measured. In particular, for two candidate keypoint locations

And

paf, L predicted along line segment pair _c Sampling was done to measure the confidence of the association between them:

wherein, the position of two key points of p (u) interpolation

And

in practice, the integral is approximated by sampling and summing evenly spaced values of u.

In an alternative embodiment, the step five of the fall detection algorithm is a spanning tree skeleton for selecting a minimum number of edges to obtain a human body posture, a greedy relaxation method is provided, which can continuously generate high-quality matching, two types of 'falling' or 'no-falling' are predicted by supporting multi-camera and multi-person tracking and an LSTM (long-short memory) neural network, so that human body posture estimation is enhanced, and five time and space features are extracted from a posture and processed by an LSTM (long-short memory) classifier.

And carrying out non-maximum value inhibition on the detection confidence map so as to obtain a group of discrete key point candidate positions. For each part, there may be multiple candidate keypoints due to multiple people or false positives in the image. These candidate keypoints define a large set of possible limbs. Each candidate limb is scored using a line integral calculation over the PAF defined in equation (11). The optimal resolution problem is searched to correspond to a K-dimensional NP difficult problem, and a greedy relaxation method is designed in the invention and can continuously generate high-quality matching. The reason for this is presumably that since PAF networks have a large receptive field, the pairwise association score implicitly encodes the global context.

Formally, a group of human body part detection candidates D of a plurality of persons is first obtained _J Wherein, in the step (A),

wherein N is _j Is the number of candidates for the keypoint j,

is the m-th detection candidate position of the keypoint j.

These candidate detection keypoints still need to be associated with other body parts from the same person, i.e. part detection pairs that need to find limbs that are actually connected, defining a variable

To represent two detection candidate points

And

whether to connect, the goal is to find the optimal allocation among all possible sets of connections,

if a single keypoint j of the c-th limb is considered ₁ And j ₂ Finding the optimal correlation problem is simplified to the maximum weight bipartite graph matching problem. In such a graph matching problem, the nodes of the graph are body detection key points

And

edges are all possible connections between detection candidate pairs. Further, each edge is weighted by the formula (11) part affinity aggregate. The objective in selecting a subset of edges in a bipartite graph, in such a way that no two edges share a node, is to find the most weighted match for the selected edges,

wherein E is _c Represents the total weight of matches from limb type c,

is of limb type c

A subset of (1), E _mn Is the key point defined in equation (11)

And

part affinity in between. Equations (14) and (15) force that no two edges share a node, i.e. that no two limbs of the same type share the same keypoint. The hungarian algorithm can be used to obtain the optimal match. The matching problem is then further decomposed into a set of two matching sub-problems and matches in adjacent tree nodes are independently determined.

Determining when it comes to looking for a whole body posture of a plurality of persons

Is a K-dimensional matching problem. This problem is NP-hard and there is a need for relaxations. In the work, two relaxed optimizations are added, which are specially aimed at. First, rather than using a full graph to obtain a spanning tree skeleton for a human pose, a minimum number of edges are selected to obtain the spanning tree skeleton for the human pose. It is also shown in subsequent experimental results that greedy minimum reasoning approaches the global solution well with a fraction of the computational cost. The reason for this is that the relationships between adjacent tree nodes are explicitly modeled with PAFs, while the relationships between non-adjacent tree nodes are implicitly modeled with CNN. This occurs because CNN uses a large receptive field during training, and PAFs of non-adjacent tree nodes also affect the predicted PAF.

With these two relaxations, the optimization is simply broken down into:

therefore, it is necessary to obtain limb connection candidate points for each limb type independently using equations (13), (14), (15). With all limb connection candidate points, connections sharing the same keypoint detection candidate can be assembled into a multi-person whole body pose. The optimization scheme on the tree structure is orders of magnitude faster than the optimization on the fully connected graph. Current models also incorporate redundant PAF connections (e.g., between the ear and the shoulder, etc.) that can improve the accuracy of gesture detection in crowded images. In order to realize redundant connection, a multi-person resolution algorithm is finely adjusted. Although the original method started from the root component, the algorithm sorts all pairs of possible connections according to their PAF scores. When a connection tries to connect two body parts that have been assigned to different persons, the algorithm recognizes that this would contradict a PAF connection with a higher confidence and then ignores the current connection.

As shown in fig. 9, fall monitoring is proposed based on openpos (open posture) human action recognition, which enhances human posture estimation by supporting multiple cameras and multiple person tracking and LSTM (long time memory) neural networks to predict two types of "falls" or "no falls". From the pose, five temporal and spatial features are extracted and processed by an LSTM (long-short-time memory) classifier.

The above description of the present invention is intended to be illustrative. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A child monitoring method facing to a smart television is characterized by comprising the following steps:

the intelligent television automatically identifies an expression identification algorithm and a falling detection algorithm of the child;

the flow of the expression recognition algorithm is as follows:

the method comprises the following steps: extracting the characteristics of a basic convolutional neural network by a backbone network;

step two: the functional decomposition network decomposes the basic features into a set of face latent features;

step three: the function reconstruction network learns the relationship weight in each potential feature and the relationship weight between features, and then the overall performance feature is constructed;

step four: the expression prediction network is used for predicting expression labels corresponding to the input pictures;

the fall detection algorithm flows as follows:

the method comprises the following steps: taking the whole image as the input of a convolution neural network to carry out joint prediction;

step two, mapping confidence of human body part detection;

step three: associating human body parts with PAFs;

2. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the second flow step of the expression recognition algorithm comprises the following specific steps: for a given ith human face image, defining the basic features extracted by the backbone network as x _i ∈R ^P Where P represents the dimensionality of the underlying features, the FDN functional decomposition network decomposes the underlying features into a series of facial motion perception latent features, let L _i ＝[I _i，1 ，I _i，2 ，...，I _i，M ]∈R ^D×M It represents a latent feature matrix, where I _i，j Representing the jth potential feature of the ith human face image, D representing the dimension of each feature, and M representing the number of features;

the method adopts a mode of adding an activation function to a full connection layer to extract corresponding potential features, and the specific expression is as follows:

wherein

the compactless Loss learns the centers of the same potential features and calculates the distance between these potential features and their corresponding centers, as follows:

where N denotes the number of pictures in a small batch, c _j Representing the center of the category j potential feature.

3. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the expression recognition algorithm comprises the following three specific steps: the feature reconstruction network with different expression features comprises an inter-feature relationship modeling module which is composed of a plurality of intra-feature relationship modeling modules which establish the relationship between internal features and feature elements,

each block is composed of a full connection layer and a nonlinear activation function, and the specific expression is as follows:

wherein alpha is _i，j Represented is the weight of the correlation of the jth class of potential features with the ith face image,

representative of the parameters of the fully-connected layer, σ ₂ Representing a non-linear activation function;

by expression (3), α is calculated _i，j And taking the L1 norm as the intra-feature relation weight for determining the importance of the j-th class potential feature, wherein the specific expression is as follows:

α _i,j ＝|α _i,j | ₁ forj＝1，2，…，M， (4)

learning centers of various expressions by Distribution Loss and calculating distances between Intra-Ws from the same category and the corresponding centers, and assuming that the ith human face image belongs to the kth _i Like expressions, the expression of Distribution Loss is as follows:

w _i ＝[α _i，1 ，α _i，2 ，…，α _i，M ] ^T the representation is the Intra-W vector of the ith human face picture, and the representation is the kth human face picture _i A category center of the similar expression;

balance Loss balances the distribution of Intra-W vectors, with the expression:

represented is a vector of the mean of Intra-W in a block sample,

representing a uniformly distributed weight vector;

after calculating the Intra-W of each potential feature, multiplying the weights by the corresponding features to obtain the Intra-aware features of the corresponding ith picture, wherein the specific expression is as follows:

f _i，j ＝α _i，j l _i，j for j＝1，2，…，M， (7)

wherein f is _i，j The jth intra-aware feature of the ith picture is represented.

4. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the expression recognition algorithm comprises the following three specific steps: the method for acquiring the feature reconstruction network with different expression features comprises a feature-to-feature relation modeling module, wherein the feature-to-feature relation modeling module learns a group of relation information and estimates the Inter-Ws between the information, and firstly, f is divided into _i，j Inputting an information network for feature coding, wherein the information network consists of a full connection layer and a linear activation function, and the specific expression is as follows:

wherein

Expressed is the weight parameter, σ, of the fully-connected layer ₁ Represented by the linear activation function, g _i，j The j relation information of the ith picture is shown;

wherein g is _i，j And g _i，m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g _i，j And g _i，m The similarity function adopts Euclidean distance function, and further adopts hyperbolic tangent function sigma because the calculation results of S are all non-negative ₃ Normalizing the distance value to the interval of [0,1);

an Inter-W matrix W can be obtained according to equation (9) _i ＝{w _i (j，m)}。

The jth inter-aware feature of the ith picture can be expressed in the form:

combining the jth inter-aware feature with the jth internal perception feature to obtain the jth opportunity-aware feature of the ith human face image, wherein a specific expression is as follows:

wherein δ is a regularization coefficient used to balance inter-aware features with intra-aware features;

and accumulating a series of import-aware characteristics to obtain a final expression characteristic, wherein the expression of the final expression characteristic is as follows:

wherein y is _i The expression characteristics of the ith human face picture are shown.

5. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the backbone network, the function decomposition network, the feature reconstruction network and the expression prediction network are uniformly trained in an end-to-end mode, the following joint loss function is minimized in the whole network, and the expression is as follows:

wherein

Respectively representing the classification loss, the compact loss, the balance loss and the distribution loss, and the cross entropy loss is used as the classification loss, lambda ₁ ，λ ₂ ，λ ₃ Is a regularization coefficient.

6. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the specific step of the fall detection algorithm is the convolutional neural network, and detection of the confidence map is realized by iteratively predicting affinity fields encoding key points from key points to key points in a network architecture.

7. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the second step of the fall detection algorithm is specifically to analyze the image through a convolutional neural network to generate a group of feature maps F input to the first stage, and at this stage, the network generates a group of PAFs; l is ¹ ＝φ ¹ (F)，φ ¹ Referring to CNNs that make inferences at stage 1, in each subsequent stage, the prediction from the previous stage and the original image features F are concatenated and used to refine the prediction,

wherein phi is ^t Is used for reasoning about CNNs, T used in the T phase _P Is the total number of PAF stages; at T _P After the second iteration, the confidence map detection process is repeated starting with the most recent PAF prediction,

where ρ is ^t Refers to CNNs, T used for reasoning in stage T _C Is the total number of confidence map stages;

wherein the content of the first and second substances,

is the true value of the PAF (sum of true values),

is a true value keypoint confidence map, W is a binary mask of W (p) =0 when there is an absence of annotations at pixel p, the mask is used to avoid penalizing predictions of truepositive during training, the intermediate supervision of each stage solves the gradient vanishing problem by periodically supplementing gradients, with the overall goal as follows:

8. a smart tv-oriented child monitoring method as claimed in claim 1, wherein: the third step of the fall detection algorithm is a key point detection confidence map, and aims to solve f in a formula (6) in the training process _S Using the labeled two-dimensional key points to generate a grountruth confidence map S ^* Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k

Is the position of the group of the key point j of the body of each person k in the image, and the position p is in

The value of (1) can be defined as:

where σ controls the spread of the peak. The network predicted grountrith confidence map is an aggregation of the single confidence map through the largest operator,

9. the intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the fourth specific step of the fall detection algorithm is PAFs for keypoint association, a set of detected body parts is given, each pair of body part detections has associated confidence, that is, they belong to the same person, and the PAFs store position and direction information of the whole limb support area.

10. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the fall detection algorithm comprises the specific five steps of selecting a minimum number of edges to obtain a spanning tree skeleton of the human body posture, continuously generating high-quality matching, predicting two types of falls or no falls by supporting multi-camera and multi-person tracking and an LSTM neural network, extracting five time and space characteristics from the posture, and processing by an LSTM classifier.