CN115240127A - Smart television-oriented child monitoring method - Google Patents

Smart television-oriented child monitoring method Download PDF

Info

Publication number
CN115240127A
CN115240127A CN202210623199.8A CN202210623199A CN115240127A CN 115240127 A CN115240127 A CN 115240127A CN 202210623199 A CN202210623199 A CN 202210623199A CN 115240127 A CN115240127 A CN 115240127A
Authority
CN
China
Prior art keywords
feature
expression
features
network
intra
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210623199.8A
Other languages
Chinese (zh)
Inventor
林盛鑫
刘华珠
陈雪芳
赵晓芳
欧超超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan University of Technology
Original Assignee
Dongguan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan University of Technology filed Critical Dongguan University of Technology
Priority to CN202210623199.8A priority Critical patent/CN115240127A/en
Publication of CN115240127A publication Critical patent/CN115240127A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting

Abstract

The invention discloses a child monitoring method facing a smart television, which comprises the following steps: the intelligent television automatically identifies an expression identification algorithm and a falling detection algorithm of the child; during the playing of the smart television, when children jump around on sofas or tables, danger may occur, and at the moment, the system sends out a prompt to parents to pay attention to safety through app application; when the children watch the frightened content in the program watching process, the expression of the children can be automatically identified by the algorithm, the programs are automatically switched or the television is turned off, and meanwhile, the system can send out a prompt to parents through the app application. When the child falls down for some reasons in the living room, the system can identify the fall of the child and further judge: if the expression of the child is painful or crying, immediately sending an alarm to notify parents by the app; if the child does not cry after falling down, or the expression is normal, only one record can be made in the system, and the parents can check the falling record of the child in a falling monitoring module in the app application.

Description

Smart television-oriented child monitoring method
Technical Field
The invention relates to an image segmentation method, in particular to a child monitoring method facing to a smart television, and belongs to the field of image processing.
Background
The intelligent furniture is based on modern fashion furniture, and skillfully integrates combined intelligence, electronic intelligence, mechanical intelligence and thing connection intelligence into furniture products, so that the furniture is intelligentized, internationalized and fashionable, the home life is more convenient and comfortable, the intelligent furniture is an important component of a new and precious life style, and the intelligent furniture is a development trend and trend of future international home.
The smart television is a television product which is based on an Internet application technology, has an open operating system and a chip, has an open application platform, can realize a bidirectional man-machine interaction function, integrates various functions such as audio and video, entertainment, data and the like, and meets the diversified and personalized requirements of users. The purpose is to bring more convenient experience to the user.
The smart television is a typical representative of smart home. The smart television is a brand new generation television which is continuously growing and is advanced with the current time. The most important thing of the smart television is that a fully-open platform must be carried, and only through the fully-open platform, the fully-open platform can be used for widely starting the function formulation of the color television in which the consumers actively participate, so that the 'demand customization' and 'entertainment of the color television' can be realized, and the fully-open platform is the only effective way for solving the intelligent development of the color television.
In the natural and active age, if the children see bad contents in the television, physical and psychological health can be influenced; meanwhile, the living room is used as a main activity space of the child, and if the child falls down in the living room, parents find the child untimely, the treatment time of the child is delayed, and the like, secondary injury of the child can be caused.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a child monitoring method facing a smart television, which realizes that the smart television automatically identifies an expression identification algorithm and a falling detection algorithm of a child, so that a system can remind parents or a system to make a record through app according to the identified algorithm.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a child monitoring method facing to a smart television is characterized in that: the method comprises an expression recognition algorithm and a falling detection algorithm for automatically recognizing children by the smart television;
the flow of the expression recognition algorithm is as follows:
the method comprises the following steps: resNet-18 (backbone network) extracts the characteristics of basic CNN (convolutional neural network);
step two: FDN (function decomposition network) decomposes the basic features into a set of facial action-aware latent features;
step three: an FRN (function reconstruction network) for learning intra-feature relationship (intra-feature relationship) weight and inter-feature relationship (inter-feature relationship) weight of each potential feature and then constructing the overall expression feature;
step four: the EPN (expression prediction network) is used for predicting the expression labels corresponding to the input pictures;
the fall detection algorithm flows as follows:
the method comprises the following steps: taking the whole image as the input of CNN (convolutional neural network) to carry out joint prediction;
step two, mapping confidence of human body part detection;
step three: associating human body parts with PAFs (affinity domains);
step four: the parsing step performs a set of two-part matches to associate text part candidates;
step five: they are combined into a full body pose for use by all people in the image.
Further, the second flow step of the expression recognition algorithm specifically comprises the following steps: for a given ith face image, the underlying features extracted by ResNet-18 (backbone network) are defined as x i ∈R P Where P denotes the dimension of the underlying feature, FDNFDN (function decomposition network) decomposes the underlying feature into a series of facial action-aware latent features, let L be i =[I i,1 ,I i,2 ,...,I i,M ]∈R D×M It represents a late (latent) feature matrix, where I i,j Representing jth (potential) features of the ith human face image, D representing the dimension of each feature, and M representing the number of features;
the method adopts a mode of adding an activation function to a full connection layer to extract corresponding late (potential) features, and a specific expression is as follows:
Figure BDA0003675412110000021
wherein
Figure BDA0003675412110000022
Representative is the weight parameter, σ, of the fully-connected layer 1 Representative is the ReLU activation function;
compact Loss learns the centers of the same late (potential) features and calculates the distance between these late (potential) features and their corresponding centers, as follows:
Figure BDA0003675412110000023
wherein N is represented asNumber of pictures in one mini-batch, c j Represented is the center of the jth type of latent (latent) feature.
Further, the expression recognition algorithm comprises the following three specific steps: acquiring FRN (feature reconstruction network) of a network with different expression features comprises an Intra-RM (inter-feature relational modeling), wherein the Intra-RM (inter-feature relational modeling) is composed of a plurality of Intra-feature correlation modeling blocks, the blocks (blocks) establish the relationship between Intra-features and feature elements, each block (block) is composed of a full connection layer and a sigmoid (nonlinear) activation function, and the specific expression is as follows:
Figure BDA0003675412110000031
wherein alpha is i,j Represented is the weight of the correlation between the jth type latent feature and the ith human face image,
Figure BDA0003675412110000032
representative of the parameters of the fully-connected layer, σ 2 Representative is the sigmoid (non-linear) activation function;
by expression (3), α is calculated i,j And it is used as Intra-feature relationship Weight (Intra-feature Weight) for determining the importance of j-th type (potential) feature, and the specific expression is as follows:
α i,j =||α i,j || 1 for j=1,2,…,M, (4)
learning centers of various expressions and calculating distances between Intra-Ws (Intra-feature relationship weights) from the same category and corresponding centers by using Distribution Loss module (I-th facial image) under the assumption that the ith facial image belongs to the kth i Class expressions, then the expression of distribution loss is as follows:
Figure BDA0003675412110000033
w i =[α i,1 ,α i,2 ,…,α i,M ] T the representative is the Intra-W (Intra-feature weight) vector of the ith human face picture, and the representative is the kth human face picture i A category center of the similar expression;
balance Loss module (Balance Loss module) balances the distribution of Intra-W (Intra-feature relational weight) vectors, as expressed below:
Figure BDA0003675412110000034
Figure BDA0003675412110000035
represented by a vector of the mean of Intra-W (Intra-feature relationship weights) in the sample of batch,
Figure BDA0003675412110000036
representing a uniformly distributed weight vector;
after calculating Intra-W (inner relation weight) features of each parent (potential) feature, multiplying the weights by the corresponding features to obtain the corresponding Intra-aware (inner perception) features of the ith picture, wherein the specific expression is as follows:
f i,j =α i,j l i,j for j=1,2,…,M, (7)
wherein f is i,j Represents the j intra-aware feature of the ith picture
Further, the expression recognition algorithm comprises the following three specific steps: acquiring a network FRN (feature reconstruction network) with different expression features comprises an Inter-RM (Inter-feature relationship modeling module) which learns a set of relationship information and estimates Inter-Ws (internal workflow) between the information, and firstly, f is divided into i,j Inputting into an information network for feature coding, wherein the information network is composed of a full connection layer and a ReLU (Linear) activation functionThe specific expression is as follows:
Figure BDA0003675412110000041
wherein
Figure BDA0003675412110000042
Expressed is the weight parameter, σ, of the fully connected layer 1 Represented is the ReLU (Linear) activation function, g i,j The j relation information of the ith picture is shown;
splicing corresponding relation information into a relation information matrix G i =[g i,1 ,g i,2 ,…,g i,M ]The matrix is used as a graph G (G) i E) node, G being a complete undirected graph, E representing the relationships between different relationship messages, ω i (j, m) is Inter-W, which represents node g i,j And node g i,m The specific calculation formula of the relationship importance between the two is as follows:
Figure BDA0003675412110000043
wherein g is i,j And g i,m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g i,j And g i,m The similarity function is a Euclidean distance function, and because the calculation results of S are all non-negative, a tanh (hyperbolic tangent) function sigma is further adopted 3 Normalizing the distance value to the interval of [0,1);
an Inter-W matrix W can be obtained according to equation (9) i ={ω i (j,m)}。
The jth inter-aware feature of the ith picture can be expressed in the form:
Figure BDA0003675412110000051
combining the jth inter-aware feature with the jth intra-aware feature to obtain the jth import-aware feature of the ith human face image, wherein a specific expression is as follows:
Figure BDA0003675412110000052
where δ is a regularization coefficient used to balance inter-aware features with intra-aware features;
accumulating a series of import-aware features to obtain a final expression feature, which is expressed as follows:
Figure BDA0003675412110000053
wherein y is i The expression feature of the ith human face picture is shown.
Further, the ResNet-18 (backbone network), FDN (function decomposition network), FRN (feature reconstruction network), and EPN (expression prediction network) are uniformly trained in an end-to-end manner, and the whole network minimizes the following joint loss function, and the expression is as follows:
Figure BDA0003675412110000054
wherein
Figure BDA0003675412110000055
Respectively representing classification loss, compactness loss, balance loss, and distribution loss, and the cross-entropy loss is used as the classification loss, lambda 1 ,λ 2 ,λ 3 Is a regularization coefficient.
Further, a specific step of the fall detection algorithm is the convolutional neural network, and detection of the confidence map is realized by iteratively predicting affinity fields (similarity fields) which encode key points from key point to key point in a network architecture.
Further, the second step of the fall detection algorithm is to analyze the image through CNN (convolutional neural network) to generate a set of feature maps F input to the first stage, at this stage, the network generates a set of partial definition fields (PAFs) (part affinity fields); l is 1 =φ 1 (F),φ 1 Referring to CNNs (neural networks) that make inferences at stage 1, in each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to refine the predictions,
Figure BDA0003675412110000061
wherein phi is t Is a CNNs (neural networks) used for reasoning about the T phase, T P Is the total number of PAF stages; at T P After the second iteration, the confidence map detection process is repeated starting with the most recent PAF prediction,
Figure BDA0003675412110000062
Figure BDA0003675412110000063
where ρ is t Meaning CNNs (neural networks), T, used for reasoning during phase T C Is the total number of confidence map stages;
the loss functions are spatially weighted by applying the loss functions at the end of each phase, in particular phase t i And stage t of the PAF branch of k The penalty function of the confidence mapping branch of (1) is as follows:
Figure BDA0003675412110000064
Figure BDA0003675412110000065
wherein the content of the first and second substances,
Figure BDA0003675412110000066
is the groudtruth PAF (PAF true),
Figure BDA0003675412110000067
is a group transit (true) part confidence map, W is a binary mask of W (p) =0 when there is no label at pixel p, the mask is to avoid penalizing prediction of true positive during training, the intermediate supervision of each stage solves the gradient vanishing problem by periodically supplementing the gradient, with the overall goal as follows:
Figure BDA0003675412110000068
further, the third specific step of the fall detection algorithm is a key point detection confidence map, in order to solve f in the formula (6) in the training process S Generating a group truth confidence map S by using the labeled two-dimensional key points * Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k
Figure BDA0003675412110000071
Figure BDA0003675412110000072
Is the position of the group route (true value) of the body key point j of each person k in the image, with the position p at
Figure BDA0003675412110000073
The value of (1) can be defined as:
Figure BDA0003675412110000074
where σ controls the spread of the peak. The network predicted grouped truth confidence map is the aggregation of the single confidence map by the largest operator,
Figure BDA0003675412110000075
further, the specific step of the fall detection algorithm is PAFs (part affinity fields) for keypoint association, given a set of detected body parts, each pair of body part detections has associated confidence, i.e. they belong to the same person, and the PAFs (part affinity fields) stores position and direction information of the entire limb support area.
Furthermore, the specific step five of the fall detection algorithm is to select the minimum number of edges to obtain the spanning tree skeleton of the human body posture, and provides a greedy relaxation method which can continuously generate high-quality matching, predict two types of 'falling' or 'not falling' by supporting multi-camera and multi-person tracking and an LSTM (long-short memory) neural network, thereby enhancing the estimation of the human body posture, extract five time and space characteristics from the posture, and process the characteristics by an LSTM LSTM (long-short memory) classifier.
Compared with the prior art, the invention has the following advantages and effects:
1. the invention provides a child monitoring method facing an intelligent television, wherein an expression recognition algorithm firstly decomposes a face picture into different face actions, takes the face actions as basic units, and then carries out fusion according to the relation between the different face actions to obtain the emotion expression of the whole face picture.
2. The invention provides a child monitoring method facing a smart television, which is a fall detection algorithm, is based on OpenPose (posture evaluation) human action recognition, and predicts two types of falling or non-falling through supporting multi-camera and multi-person tracking and an LSTM (long-short-term memory) neural network, thereby enhancing human posture estimation. From the pose, we extracted five temporal and spatial features and processed through an LSTM (long-short-term memory) classifier.
3. Through the expression recognition algorithm and the falling detection algorithm, when children jump on sofas or tables in disorder, danger can possibly occur, and at the moment, the system can send out a prompt to parents to pay attention to safety through app (application);
4. through the expression recognition algorithm and the falling detection algorithm, if the children see frightened contents in the process of watching programs, the expressions of the children can be automatically recognized by the algorithm, the programs are automatically switched or the television is turned off, and meanwhile, the system sends out a prompt to parents through the app (application).
5. Through expression recognition algorithm and fall detection algorithm, children fall down (such as running and jumping randomly, floor wet and slippery) for some reasons in the range of the living room, the system can recognize the fall of children, and further judge: if the expression of the child is painful or crying, the app (application) immediately sends an alarm to notify the parents; if the child does not cry after falling down or the expression is normal, only one record is made in the system, and the parents can check the falling record of the child through a falling monitoring module in the app (application).
Drawings
FIG. 1 is an overall framework of the present invention;
FIG. 2 is a frame diagram of an expression recognition algorithm of the present invention;
FIG. 3 is a first facial action expression feature result graph of the present invention;
FIG. 4 is a second facial motion expression feature diagram of the present invention;
fig. 5 is a general flow chart of a fall detection method of the invention;
fig. 6 is a network architecture diagram of a fall detection algorithm of the invention;
fig. 7 is a body part map of fall detection of the invention;
fig. 8 is an arm limb diagram of fall detection of the present invention;
fig. 9 is a body posture diagram of the fall detection algorithm of the invention.
Detailed Description
The present invention is further illustrated by the following examples, which are illustrative of the present invention and are not to be construed as being limited thereto.
According to the child monitoring method facing the smart television, as shown in fig. 1, the method comprises an expression recognition algorithm and a fall detection algorithm for automatically recognizing children by the smart television;
the flow of the expression recognition algorithm is as follows:
the method comprises the following steps: resNet-18 (backbone network) extracts the characteristics of basic CNN (convolutional neural network);
step two: FDN (function decomposition network) decomposes the basic features into a set of facial action-aware latent features;
step three: an FRN (function reconstruction network) for learning intra-feature relationship (intra-feature relationship) weight and inter-feature relationship (inter-feature relationship) weight of each potential feature and then constructing the overall expression feature;
step four: the EPN (expression prediction network) is used for predicting the expression labels corresponding to the input pictures;
the fall detection algorithm flows as follows:
the method comprises the following steps: taking the whole image as the input of CNN (convolutional neural network) to carry out joint prediction;
step two, mapping confidence of human body part detection;
step three: associating human body parts with PAFs (affinity domains);
step four: the parsing step performs a set of two-part matches to associate text part candidates;
step five: they are combined into a full body pose for use by all people in the image.
The expression recognition algorithm, because different emotions contain the same face action, the combination of the face action is different between different emotions, such as Surprised, fear, anger, happy, angry, sad, and angry. In order to better mine the relation between the human face actions and the facial expressions, the facial image is firstly decomposed into different human face actions in the expression recognition algorithm, the human face actions are used as basic units, and then the emotional expression of the whole facial image is obtained by fusion according to the relation between the different human face actions. As shown in fig. 2-4, the expression recognition algorithm uses a facial expression recognition method based on feature deconstruction and reconstruction learning, which is respectively used for learning shared information and specific information in expression information, so as to extract more significant expression features.
As shown in fig. 5-9, an OpenPose (open posture) human posture recognition project is an open source library developed by a U.S. CMU (university of kanji melong) based on a convolutional neural network and supervised learning and using caffe (frame) as a frame, can realize posture estimation of human body actions, facial expressions, finger motions and the like, is suitable for single people and multiple people, has excellent robustness, is the first real-time multi-person two-dimensional posture estimation application based on deep learning in the world, and based on the fact that the real-time multi-person two-dimensional posture estimation application is developed like spring bamboo shoots after rain, the human posture estimation technology has wide application prospects in the fields of sports fitness, action acquisition, 3D (three-dimensional) fitting, public opinion monitoring and the like, a more familiar application of people is a trembling and embarrassing machine, and human body action recognition based on OpenPose (open posture), and we propose fall monitoring.
In an optional implementation manner, the second specific step of the flow of the expression recognition algorithm is: for a given ith face image, the underlying features extracted by ResNet-18 (backbone network) are defined as x i ∈R P Where P denotes the dimension of the underlying feature, FDNFDN (functional decomposition network) decomposes the underlying feature into a series of facial action-aware potential features. Let L i =[I i,1 ,I i,2 ,...,I i,M ]∈R D×M It represents a late (latent) feature matrix, where I i,j Representing the jth (potential) feature of the ith face image, D representing the dimension of each feature, and M representing the number of features.
The method adopts a mode of adding an activation function to a full connection layer to extract corresponding late (potential) features, and a specific expression is as follows:
Figure BDA0003675412110000101
wherein
Figure BDA0003675412110000102
Representative is the weight parameter, σ, of the fully-connected layer 1 Representative is the ReLU activation function;
compact Loss learns the centers of the same late (potential) features and calculates the distance between these late (potential) features and their corresponding centers, as follows:
Figure BDA0003675412110000103
where N denotes the number of pictures in a mini-batch, c j Representing the center of the jth class of latent features, updated once per mini-batch, a series of compact latent features can be learned more efficiently by minimizing the compact loss.
Since different facial expressions share a series of identical latent features, the jth latent feature extracted from one base feature should be similar to the latent feature extracted from another base feature, and therefore, the compact loss is designed. The latent features acquired by FDN (functional decomposition network) are more refined and feature that can sense facial actions, which will be used for the subsequent work of expressive feature extraction.
In an optional embodiment, the expression recognition algorithm includes three specific steps: acquiring FRN (feature reconstruction network) of a network with different expression features comprises an Intra-RM (feature-to-feature relational modeling), wherein the Intra-RM (feature-to-feature relational modeling) is composed of a plurality of Intra-feature correlation modeling blocks, the blocks (blocks) establish the relationship between Intra-features and feature elements, each block (block) is composed of a full connection layer and a sigmoid (nonlinear) activation function, and the specific expression is as follows:
Figure BDA0003675412110000111
wherein alpha is i,j Represented is the weight of the correlation between the jth type latent feature and the ith human face image,
Figure BDA0003675412110000112
representative of the parameters of the fully-connected layer, σ 2 Representative is the sigmoid (non-linear) activation function;
by expression (3), Δ is calculated i,j And it is used as Intra-feature relationship Weight (Intra-feature Weight) for determining the importance of j-th type (potential) feature, and the specific expression is as follows:
α i,j =||α i,j || 1 for j=1,2,…,M, (4)
distribution Loss module learns the centers of various types of expressions and calculates the distance between Intra-Ws (Intra-feature relationship weights) from the same class and the corresponding centers. Suppose that the ith human face image belongs to the kth i Similar expressions, the expression of distribution loss is as follows:
Figure BDA0003675412110000113
w i =[α i,1 ,α i,2 ,…,α i,M ] T representative is the ith human faceThe Intra-W (Intra-feature weight) vector of the picture indicates the kth i A category center of the similar expression;
balance Loss module (Balance Loss module) balances the distribution of Intra-W (Intra-feature relational weight) vectors, as expressed below:
Figure BDA0003675412110000121
Figure BDA0003675412110000122
represents a vector formed by the mean values of Intra-W (Intra-feature relationship weights) in a batch sample,
Figure BDA0003675412110000123
representing a uniformly distributed weight vector;
after calculating Intra-W (internal relation weight) features of each latent feature, multiplying the weights by the corresponding features to obtain the Intra-aware (internal perception) features of the corresponding ith picture, wherein the specific expression is as follows:
f i,j =α i,j l i,j for j=1,2,…,M, (7)
wherein f is i,j Representative is the jth intra-aware feature of the ith picture.
The Distribution of the Intra-Ws of different images in the same expression category is as close as possible, so that the Distribution Loss module is similar to the compact Loss module, and the Distribution of the Intra-Ws of different images in the same expression category is closer by optimizing the Distribution Loss module, so that the Distribution of the Intra-Ws of different images in the same expression category can be focused on expression-specific.
In practice, we find that the Intra-Ws (Intra-relationship weights) related to multiple latent (potential) features generally show higher value than the Intra-Ws (Intra-relationship weights) of each picture, thus resulting in overall distribution imbalance, and therefore design Balance Loss module.
In an optional implementation manner, the expression recognition algorithm includes three specific steps: acquiring a network FRN (feature reconstruction network) with different expression features comprises an Inter-RM (Inter-feature relationship modeling module) which learns a set of relationship information and estimates Inter-Ws (internal workflow) between the information, and firstly, f is divided into i,j Inputting an information network for feature coding, wherein the information network consists of a full connection layer and a ReLU (linear) activation function, and the specific expression is as follows:
Figure BDA0003675412110000124
wherein
Figure BDA0003675412110000125
Expressed is the weight parameter, σ, of the fully-connected layer 1 Represented is the ReLU (Linear) activation function, g i,j The j relation information of the ith picture is shown;
splicing corresponding relation information into a relation information matrix G i =[g i,1 ,g i,2 ,…,g i,M ]The matrix is used as a graph G (G) i E) node, G being a complete undirected graph, E representing the relationships between different relationship messages, ω i (j, m) is Inter-W, which represents node g i,j And node g i,m The specific calculation formula of the relationship importance between the two is as follows:
Figure BDA0003675412110000131
wherein g is i,j And g i,m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g i,j And g i,m The function of similarity, which is used in the present invention, is the Euclidean distance functionSince the calculated result of S is all non-negative, a tan h (hyperbolic tangent) function σ is further used 3 Normalizing the distance value to the interval of [0,1);
an Inter-W matrix W can be obtained according to equation (9) i ={ω i (j,m)}。
The jth inter-aware feature of the ith picture can be expressed in the form:
Figure BDA0003675412110000132
combining the jth inter-aware feature with the jth intra-aware feature to obtain the jth import-aware feature of the ith human face image, wherein a specific expression is as follows:
Figure BDA0003675412110000133
where δ is a regularization coefficient used to balance inter-aware features with intra-aware features;
accumulating a series of import-aware features to obtain a final expression feature, which is expressed as follows:
Figure BDA0003675412110000134
wherein y is i The expression feature of the ith human face picture is shown.
Wherein the Intra-W (Intra-feature relationship weights) of each parent (potential) feature can be learned in the Intra-RM (inter-feature relationship modeling module), but these weights are obtained independently. Although distribution loss module (distribution loss module) has been regularized to some extent for Intra-W (Intra-feature relationship weights), it still fails to fully consider the inter-relationship between the (potential) features. In fact, there are some facial movements that are relevant for different facial expressions. Therefore, it is also important for expression recognition to use related facial motion features of different expressions, and thus an Inter-RM (model for modeling relationship between features) was designed
An alternative embodiment, resNet-18 (backbone), FDN (functional decomposition network), FRN (feature reconstruction network), and EPN (expression prediction network) are uniformly trained in an end-to-end manner, minimizing the following joint loss function over the entire network, as follows:
Figure BDA0003675412110000141
wherein
Figure BDA0003675412110000142
Respectively representing classification loss, compactness loss, balance loss, and distribution loss, and the cross-entropy loss is used as the classification loss, lambda 1 ,λ 2 ,λ 3 Is a regularization coefficient.
By optimizing joint loss, facial expression recognition tasks can be extracted with more detailed facial expression features with different degrees.
As shown in fig. 4-5, in the visualization result, different human face action unit branches can focus on different actions of the face, and the final experimental result proves that the text provides a better result by using the human face action as a basic expression unit and adopting a fine-grained modeling method of decoupling first and fusing later.
In an alternative embodiment, a specific step of the fall detection algorithm is the convolutional neural network, and the detection of the confidence map is realized by iteratively predicting affinity fields (similarity fields) encoding from key point to key point in the network architecture.
Wherein the network architecture diagram is shown in FIG. 6The function of the blue color block is to iteratively predict the affinity fields that encode keypoints-to-keypoints. And in the green color block, the detection of the confidence map is realized. The prediction in successive stages is improved, T ∈ { 1., T }, and each stage has an intervening review. In the original approach the network architecture included several 7 x 7 convolutional layers. In the inventive model, 3 consecutive 3 × 3 convolution kernels are used instead of a single 7 × 7 convolution kernel, the number of operations of the former being 2 × 7 2 -1=97, the latter being only 51, and in addition, each of the 3 convolution kernels is cascaded, the number of nonlinear layers being increased by a factor of two, the network retaining both lower-level and higher-level features.
In an alternative embodiment, the second step of the fall detection algorithm is to analyze the image by CNN (convolutional neural network) to generate a set of feature maps F that are input to the first stage, at which stage the network generates a set of Partffenitifields (PAFs) (part affinity fields); l is 1 =φ 1 (F),φ 1 Referring to CNNs (neural networks) that make inferences at stage 1, in each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to refine the predictions,
Figure BDA0003675412110000151
wherein phi is t Is a CNNs (neural networks) used for reasoning about the T phase, T P Is the total number of PAF stages; at T P After the second iteration, the confidence map detection process is repeated starting with the most recent PAF prediction,
Figure BDA0003675412110000152
Figure BDA0003675412110000153
wherein the content of the first and second substances,ρ t meaning CNNs (neural networks), T, used for reasoning in phase T C Is the total number of confidence map stages.
The method of the invention refines PAF and confidence map branches (confidence map branches) in each stage, thereby achieving the effect of reducing the calculated amount by half in each stage. During subsequent experiments, improved affinityfields (similarity fields) were found to improve confidence map results, while the opposite was not true. Intuitively, if one looks at the PAF channel output, one can guess the position of the body part, but if one sees a pile of body key points without other information, one cannot resolve them into different people.
Wherein, in order to direct the network to iteratively predict the PAFs of the body part at a first branch and to iteratively predict the confidence level at a second branch, a loss function is applied at the end of each phase, the loss function being spatially weighted using (estimated predictions) and the L2 loss between the group route map and the field, in particular, the phase t i And stage t of the PAF branch of k The penalty function of the confidence mapping branch of (1) is as follows:
Figure BDA0003675412110000161
Figure BDA0003675412110000162
wherein the content of the first and second substances,
Figure BDA0003675412110000163
is the groudtruth PAF (PAF true),
Figure BDA0003675412110000164
is a group reliability map, W is a binary mask of W (p) =0 when there is a lack of labels at pixel p, the mask is to avoid penalizing prediction of true positive during training, in for each stageterm supervise solves the gradient disappearance problem by periodically supplementing the gradient, with the overall goal as follows:
Figure BDA0003675412110000165
in an alternative embodiment, the third specific step of the fall detection algorithm is a confidence map of key point detection, in order to find f in formula (6) during the training process S Generating a group truth confidence map S by using the labeled two-dimensional key points * Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k
Figure BDA0003675412110000166
Figure BDA0003675412110000167
Is the position of the group route (true value) of the body key point j of each person k in the image, with the position p at
Figure BDA0003675412110000168
The value of (1) can be defined as:
Figure BDA0003675412110000169
where σ controls the spread of the peak. The network predicted grouped truth confidence map is the aggregation of the single confidence map by the largest operator,
Figure BDA00036754121100001610
wherein, for expression (7), in order to find f in expression (6) in the training process S Generating a group truth confidence map S by using the labeled two-dimensional key points * Each confidence map is a feature of interestThe body-determining part may be a 2D representation of the beliefs in any given pixel. Ideally, if a person appears in the image, then there should be a peak in each confidence map if the corresponding keypoint is visible; if multiple people are present in the image, each visible keypoint j for each person k should have a peak.
Where, for expression (8), the maximum value in the confidence map is taken instead of the average value, so that the accuracy of the nearby peaks remains significant. At test time, a confidence map is predicted and body keypoint candidates are obtained by performing non-maximum suppression.
In an alternative embodiment, the specific step of the fall detection algorithm is PAFs (part affinity field) for keypoint association, given a group of detected body parts, with associated confidence for each pair of body part detections, i.e. they belong to the same person, the PAFs (part affinity field) storing position and orientation information for the entire limb support zone.
Where, as shown in fig. 7, how do a given set of detected body parts (as shown by the red and blue dots in fig. 7) make up them into an unknown number of people's full body positions? A confidence level is needed that is associated for each pair of body-part detections, i.e. they belong to the same person. One possible measure is to detect additional midpoints between each pair of components on the limb and check its incidence of detection at candidate components, as shown in fig. 7 (b). However, when people are crowded together, these midpoints are likely to support erroneous correlations (as shown by the green lines in fig. 7 (b)). Such error association occurs mainly for two reasons:
it only encodes the position of each limb, not the orientation of the limb
It reduces the support area of the limb to one point
PAFs (parts affinity fields) solve these problems, they hold position and orientation information for the entire limb support area (as shown in fig. 7 (c)), each PAF being a 2D vector field for each limb, the 2D vector encoding the direction pointing from one part of the limb to the other for each pixel in the area belonging to a particular limb, each limb having a corresponding PAF connecting its two related body parts.
Consider a simple limb arm, as shown in FIG. 8, with
Figure BDA0003675412110000171
And
Figure BDA0003675412110000172
is a body key point j of a limb c of a person k in the image 1 And j 2 The group channel (true) position of (c). If a point p falls on the limb, then
Figure BDA0003675412110000181
From j of 1 To j 2 A unit vector of (a); for other points, the vector value is 0.
To calculate f in equation 6 during training L We define a grountruth PAF (true PAF) at point p,
Figure BDA0003675412110000182
comprises the following steps:
Figure BDA0003675412110000183
in this case, the amount of the solvent to be used,
Figure BDA0003675412110000184
is the unit vector of the limb direction. The set of points on the limb is defined as the set of points within the distance threshold of the line segment, i.e., for the set of points p within the distance threshold of the line segment,
Figure BDA0003675412110000185
wherein the width of the limb σ l Is the pixel distance, the length of the limb
Figure BDA0003675412110000186
v Is a vector perpendicular to v.
The groudtuth PAF (true PAF) averages the affinity fields of all people in the image,
Figure BDA0003675412110000187
wherein n is c (p) is the number of non-zero vectors at p for all k individuals.
During testing, the correlation between candidate part detections is measured by computing line integrals over the respective PAFs along line segments connecting the candidate keypoint locations. In other words, the alignment of the predicted PAF with the candidate limb formed by connecting the detected body parts is measured. In particular, for two candidate keypoint locations
Figure BDA00036754121100001813
And
Figure BDA0003675412110000188
paf, L predicted along line segment pair c Sampling was done to measure the confidence of the association between them:
Figure BDA0003675412110000189
wherein, the position of two key points of p (u) interpolation
Figure BDA00036754121100001810
And
Figure BDA00036754121100001811
Figure BDA00036754121100001812
in practice, the integral is approximated by sampling and summing evenly spaced values of u.
In an alternative embodiment, the step five of the fall detection algorithm is a spanning tree skeleton for selecting a minimum number of edges to obtain a human body posture, a greedy relaxation method is provided, which can continuously generate high-quality matching, two types of 'falling' or 'no-falling' are predicted by supporting multi-camera and multi-person tracking and an LSTM (long-short memory) neural network, so that human body posture estimation is enhanced, and five time and space features are extracted from a posture and processed by an LSTM (long-short memory) classifier.
And carrying out non-maximum value inhibition on the detection confidence map so as to obtain a group of discrete key point candidate positions. For each part, there may be multiple candidate keypoints due to multiple people or false positives in the image. These candidate keypoints define a large set of possible limbs. Each candidate limb is scored using a line integral calculation over the PAF defined in equation (11). The optimal resolution problem is searched to correspond to a K-dimensional NP difficult problem, and a greedy relaxation method is designed in the invention and can continuously generate high-quality matching. The reason for this is presumably that since PAF networks have a large receptive field, the pairwise association score implicitly encodes the global context.
Formally, a group of human body part detection candidates D of a plurality of persons is first obtained J Wherein, in the step (A),
Figure BDA0003675412110000191
wherein N is j Is the number of candidates for the keypoint j,
Figure BDA0003675412110000192
is the m-th detection candidate position of the keypoint j.
These candidate detection keypoints still need to be associated with other body parts from the same person, i.e. part detection pairs that need to find limbs that are actually connected, defining a variable
Figure BDA0003675412110000193
To represent two detection candidate points
Figure BDA0003675412110000194
And
Figure BDA0003675412110000195
whether to connect, the goal is to find the optimal allocation among all possible sets of connections,
Figure BDA0003675412110000196
if a single keypoint j of the c-th limb is considered 1 And j 2 Finding the optimal correlation problem is simplified to the maximum weight bipartite graph matching problem. In such a graph matching problem, the nodes of the graph are body detection key points
Figure BDA0003675412110000197
And
Figure BDA0003675412110000198
edges are all possible connections between detection candidate pairs. Further, each edge is weighted by the formula (11) part affinity aggregate. The objective in selecting a subset of edges in a bipartite graph, in such a way that no two edges share a node, is to find the most weighted match for the selected edges,
Figure BDA0003675412110000199
Figure BDA0003675412110000201
Figure BDA0003675412110000202
wherein E is c Represents the total weight of matches from limb type c,
Figure BDA0003675412110000203
is of limb type c
Figure BDA0003675412110000204
A subset of (1), E mn Is the key point defined in equation (11)
Figure BDA0003675412110000205
And
Figure BDA0003675412110000206
part affinity in between. Equations (14) and (15) force that no two edges share a node, i.e. that no two limbs of the same type share the same keypoint. The hungarian algorithm can be used to obtain the optimal match. The matching problem is then further decomposed into a set of two matching sub-problems and matches in adjacent tree nodes are independently determined.
Determining when it comes to looking for a whole body posture of a plurality of persons
Figure BDA0003675412110000207
Is a K-dimensional matching problem. This problem is NP-hard and there is a need for relaxations. In the work, two relaxed optimizations are added, which are specially aimed at. First, rather than using a full graph to obtain a spanning tree skeleton for a human pose, a minimum number of edges are selected to obtain the spanning tree skeleton for the human pose. It is also shown in subsequent experimental results that greedy minimum reasoning approaches the global solution well with a fraction of the computational cost. The reason for this is that the relationships between adjacent tree nodes are explicitly modeled with PAFs, while the relationships between non-adjacent tree nodes are implicitly modeled with CNN. This occurs because CNN uses a large receptive field during training, and PAFs of non-adjacent tree nodes also affect the predicted PAF.
With these two relaxations, the optimization is simply broken down into:
Figure BDA0003675412110000208
therefore, it is necessary to obtain limb connection candidate points for each limb type independently using equations (13), (14), (15). With all limb connection candidate points, connections sharing the same keypoint detection candidate can be assembled into a multi-person whole body pose. The optimization scheme on the tree structure is orders of magnitude faster than the optimization on the fully connected graph. Current models also incorporate redundant PAF connections (e.g., between the ear and the shoulder, etc.) that can improve the accuracy of gesture detection in crowded images. In order to realize redundant connection, a multi-person resolution algorithm is finely adjusted. Although the original method started from the root component, the algorithm sorts all pairs of possible connections according to their PAF scores. When a connection tries to connect two body parts that have been assigned to different persons, the algorithm recognizes that this would contradict a PAF connection with a higher confidence and then ignores the current connection.
As shown in fig. 9, fall monitoring is proposed based on openpos (open posture) human action recognition, which enhances human posture estimation by supporting multiple cameras and multiple person tracking and LSTM (long time memory) neural networks to predict two types of "falls" or "no falls". From the pose, five temporal and spatial features are extracted and processed by an LSTM (long-short-time memory) classifier.
The above description of the present invention is intended to be illustrative. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims (10)

1. A child monitoring method facing to a smart television is characterized by comprising the following steps:
the intelligent television automatically identifies an expression identification algorithm and a falling detection algorithm of the child;
the flow of the expression recognition algorithm is as follows:
the method comprises the following steps: extracting the characteristics of a basic convolutional neural network by a backbone network;
step two: the functional decomposition network decomposes the basic features into a set of face latent features;
step three: the function reconstruction network learns the relationship weight in each potential feature and the relationship weight between features, and then the overall performance feature is constructed;
step four: the expression prediction network is used for predicting expression labels corresponding to the input pictures;
the fall detection algorithm flows as follows:
the method comprises the following steps: taking the whole image as the input of a convolution neural network to carry out joint prediction;
step two, mapping confidence of human body part detection;
step three: associating human body parts with PAFs;
step four: the parsing step performs a set of two-part matches to associate text part candidates;
step five: they are combined into a full body pose for use by all people in the image.
2. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the second flow step of the expression recognition algorithm comprises the following specific steps: for a given ith human face image, defining the basic features extracted by the backbone network as x i ∈R P Where P represents the dimensionality of the underlying features, the FDN functional decomposition network decomposes the underlying features into a series of facial motion perception latent features, let L i =[I i,1 ,I i,2 ,...,I i,M ]∈R D×M It represents a latent feature matrix, where I i,j Representing the jth potential feature of the ith human face image, D representing the dimension of each feature, and M representing the number of features;
the method adopts a mode of adding an activation function to a full connection layer to extract corresponding potential features, and the specific expression is as follows:
Figure FDA0003675412100000021
wherein
Figure FDA0003675412100000022
Representative is the weight parameter, σ, of the fully-connected layer 1 Representative is the ReLU activation function;
the compactless Loss learns the centers of the same potential features and calculates the distance between these potential features and their corresponding centers, as follows:
Figure FDA0003675412100000023
where N denotes the number of pictures in a small batch, c j Representing the center of the category j potential feature.
3. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the expression recognition algorithm comprises the following three specific steps: the feature reconstruction network with different expression features comprises an inter-feature relationship modeling module which is composed of a plurality of intra-feature relationship modeling modules which establish the relationship between internal features and feature elements,
each block is composed of a full connection layer and a nonlinear activation function, and the specific expression is as follows:
Figure FDA0003675412100000024
wherein alpha is i,j Represented is the weight of the correlation of the jth class of potential features with the ith face image,
Figure FDA0003675412100000025
representative of the parameters of the fully-connected layer, σ 2 Representing a non-linear activation function;
by expression (3), α is calculated i,j And taking the L1 norm as the intra-feature relation weight for determining the importance of the j-th class potential feature, wherein the specific expression is as follows:
α i,j =|α i,j | 1 forj=1,2,…,M, (4)
learning centers of various expressions by Distribution Loss and calculating distances between Intra-Ws from the same category and the corresponding centers, and assuming that the ith human face image belongs to the kth i Like expressions, the expression of Distribution Loss is as follows:
Figure FDA0003675412100000031
w i =[α i,1 ,α i,2 ,…,α i,M ] T the representation is the Intra-W vector of the ith human face picture, and the representation is the kth human face picture i A category center of the similar expression;
balance Loss balances the distribution of Intra-W vectors, with the expression:
Figure FDA0003675412100000032
Figure FDA0003675412100000033
represented is a vector of the mean of Intra-W in a block sample,
Figure FDA0003675412100000034
representing a uniformly distributed weight vector;
after calculating the Intra-W of each potential feature, multiplying the weights by the corresponding features to obtain the Intra-aware features of the corresponding ith picture, wherein the specific expression is as follows:
f i,j =α i,j l i,j for j=1,2,…,M, (7)
wherein f is i,j The jth intra-aware feature of the ith picture is represented.
4. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the expression recognition algorithm comprises the following three specific steps: the method for acquiring the feature reconstruction network with different expression features comprises a feature-to-feature relation modeling module, wherein the feature-to-feature relation modeling module learns a group of relation information and estimates the Inter-Ws between the information, and firstly, f is divided into i,j Inputting an information network for feature coding, wherein the information network consists of a full connection layer and a linear activation function, and the specific expression is as follows:
Figure FDA0003675412100000041
wherein
Figure FDA0003675412100000042
Expressed is the weight parameter, σ, of the fully-connected layer 1 Represented by the linear activation function, g i,j The j relation information of the ith picture is shown;
splicing corresponding relation information into a relation information matrix G i =[g i,1 ,g i,2 ,…,g i,M ]The matrix is used as a graph G (G) i E) node, G being a complete undirected graph, E representing the relationships between different relationship messages, ω i (j, m) is Inter-W, which represents node g i,j And node g i,m The specific calculation formula of the relationship importance between the two is as follows:
Figure FDA0003675412100000043
wherein g is i,j And g i,m Respectively representing j and m relation information of the ith picture; s is a number for evaluating g i,j And g i,m The similarity function adopts Euclidean distance function, and further adopts hyperbolic tangent function sigma because the calculation results of S are all non-negative 3 Normalizing the distance value to the interval of [0,1);
an Inter-W matrix W can be obtained according to equation (9) i ={w i (j,m)}。
The jth inter-aware feature of the ith picture can be expressed in the form:
Figure FDA0003675412100000044
combining the jth inter-aware feature with the jth internal perception feature to obtain the jth opportunity-aware feature of the ith human face image, wherein a specific expression is as follows:
Figure FDA0003675412100000045
wherein δ is a regularization coefficient used to balance inter-aware features with intra-aware features;
and accumulating a series of import-aware characteristics to obtain a final expression characteristic, wherein the expression of the final expression characteristic is as follows:
Figure FDA0003675412100000051
wherein y is i The expression characteristics of the ith human face picture are shown.
5. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the backbone network, the function decomposition network, the feature reconstruction network and the expression prediction network are uniformly trained in an end-to-end mode, the following joint loss function is minimized in the whole network, and the expression is as follows:
Figure FDA0003675412100000053
wherein
Figure FDA0003675412100000052
Respectively representing the classification loss, the compact loss, the balance loss and the distribution loss, and the cross entropy loss is used as the classification loss, lambda 1 ,λ 2 ,λ 3 Is a regularization coefficient.
6. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the specific step of the fall detection algorithm is the convolutional neural network, and detection of the confidence map is realized by iteratively predicting affinity fields encoding key points from key points to key points in a network architecture.
7. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the second step of the fall detection algorithm is specifically to analyze the image through a convolutional neural network to generate a group of feature maps F input to the first stage, and at this stage, the network generates a group of PAFs; l is 1 =φ 1 (F),φ 1 Referring to CNNs that make inferences at stage 1, in each subsequent stage, the prediction from the previous stage and the original image features F are concatenated and used to refine the prediction,
Figure FDA0003675412100000061
wherein phi is t Is used for reasoning about CNNs, T used in the T phase P Is the total number of PAF stages; at T P After the second iteration, the confidence map detection process is repeated starting with the most recent PAF prediction,
Figure FDA0003675412100000062
Figure FDA0003675412100000063
where ρ is t Refers to CNNs, T used for reasoning in stage T C Is the total number of confidence map stages;
the loss functions are spatially weighted by applying the loss functions at the end of each phase, in particular phase t i And stage t of the PAF branch of k The penalty function of the confidence mapping branch of (1) is as follows:
Figure FDA0003675412100000064
Figure FDA0003675412100000065
wherein the content of the first and second substances,
Figure FDA0003675412100000066
is the true value of the PAF (sum of true values),
Figure FDA0003675412100000067
is a true value keypoint confidence map, W is a binary mask of W (p) =0 when there is an absence of annotations at pixel p, the mask is used to avoid penalizing predictions of truepositive during training, the intermediate supervision of each stage solves the gradient vanishing problem by periodically supplementing gradients, with the overall goal as follows:
Figure FDA0003675412100000068
8. a smart tv-oriented child monitoring method as claimed in claim 1, wherein: the third step of the fall detection algorithm is a key point detection confidence map, and aims to solve f in a formula (6) in the training process S Using the labeled two-dimensional key points to generate a grountruth confidence map S * Each confidence map is a 2D representation of the beliefs that a particular body part can be located in any given pixel, generating a personal confidence map for each person k
Figure FDA0003675412100000071
Figure FDA0003675412100000072
Is the position of the group of the key point j of the body of each person k in the image, and the position p is in
Figure FDA0003675412100000073
The value of (1) can be defined as:
Figure FDA0003675412100000074
where σ controls the spread of the peak. The network predicted grountrith confidence map is an aggregation of the single confidence map through the largest operator,
Figure FDA0003675412100000075
9. the intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the fourth specific step of the fall detection algorithm is PAFs for keypoint association, a set of detected body parts is given, each pair of body part detections has associated confidence, that is, they belong to the same person, and the PAFs store position and direction information of the whole limb support area.
10. The intelligent television-oriented child monitoring method according to claim 1, wherein the method comprises the following steps: the fall detection algorithm comprises the specific five steps of selecting a minimum number of edges to obtain a spanning tree skeleton of the human body posture, continuously generating high-quality matching, predicting two types of falls or no falls by supporting multi-camera and multi-person tracking and an LSTM neural network, extracting five time and space characteristics from the posture, and processing by an LSTM classifier.
CN202210623199.8A 2022-06-01 2022-06-01 Smart television-oriented child monitoring method Pending CN115240127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210623199.8A CN115240127A (en) 2022-06-01 2022-06-01 Smart television-oriented child monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210623199.8A CN115240127A (en) 2022-06-01 2022-06-01 Smart television-oriented child monitoring method

Publications (1)

Publication Number Publication Date
CN115240127A true CN115240127A (en) 2022-10-25

Family

ID=83670331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210623199.8A Pending CN115240127A (en) 2022-06-01 2022-06-01 Smart television-oriented child monitoring method

Country Status (1)

Country Link
CN (1) CN115240127A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453027A (en) * 2023-06-12 2023-07-18 深圳市玩瞳科技有限公司 AI identification management method for educational robot

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453027A (en) * 2023-06-12 2023-07-18 深圳市玩瞳科技有限公司 AI identification management method for educational robot
CN116453027B (en) * 2023-06-12 2023-08-22 深圳市玩瞳科技有限公司 AI identification management method for educational robot

Similar Documents

Publication Publication Date Title
Beddiar et al. Vision-based human activity recognition: a survey
Zhang et al. Facial expression recognition based on deep evolutional spatial-temporal networks
Onofri et al. A survey on using domain and contextual knowledge for human activity recognition in video streams
Zhang et al. Information fusion in visual question answering: A survey
Lu et al. GAIM: Graph attention interaction model for collective activity recognition
Özyer et al. Human action recognition approaches with video datasets—A survey
Lillo et al. Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos
Wang et al. Deep appearance and motion learning for egocentric activity recognition
Kothari Yoga pose classification using deep learning
Xu et al. A hierarchical spatio-temporal model for human activity recognition
Verma et al. Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition
Gu et al. Multiple stream deep learning model for human action recognition
D'Sa et al. A survey on vision based activity recognition, its applications and challenges
Bu Human motion gesture recognition algorithm in video based on convolutional neural features of training images
Xu et al. Scene image and human skeleton-based dual-stream human action recognition
Praveen et al. Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention
Wang et al. A deep clustering via automatic feature embedded learning for human activity recognition
Wen et al. Multi-view gait recognition based on generative adversarial network
Huan et al. Human complex activity recognition with sensor data using multiple features
Chen et al. Hierarchical posture representation for robust action recognition
CN115240127A (en) Smart television-oriented child monitoring method
Li et al. SMAM: Self and mutual adaptive matching for skeleton-based few-shot action recognition
CN113408721A (en) Neural network structure searching method, apparatus, computer device and storage medium
Tapaswi et al. Long term spatio-temporal modeling for action detection
Xie et al. A pyramidal deep learning architecture for human action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination