CN113420681A - Behavior recognition and model training method, apparatus, storage medium, and program product - Google Patents

Behavior recognition and model training method, apparatus, storage medium, and program product Download PDF

Info

Publication number
CN113420681A
CN113420681A CN202110721126.8A CN202110721126A CN113420681A CN 113420681 A CN113420681 A CN 113420681A CN 202110721126 A CN202110721126 A CN 202110721126A CN 113420681 A CN113420681 A CN 113420681A
Authority
CN
China
Prior art keywords
behavior
image
image features
data
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110721126.8A
Other languages
Chinese (zh)
Inventor
胡韬
苏翔博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110721126.8A priority Critical patent/CN113420681A/en
Publication of CN113420681A publication Critical patent/CN113420681A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present disclosure provides a behavior recognition and model training method, device, storage medium, and program product, which relate to the field of artificial intelligence, and in particular to computer vision and deep learning technologies, and which are particularly applicable to smart cities and intelligent traffic scenarios. The specific implementation scheme is as follows: the behavior recognition model comprises a behavior recognition model comprising a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. Inputting an image to be processed into a feature extraction network to extract image features of the image; the image characteristics and the position information codes corresponding to the image characteristics are input into the classification network based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the image is identified through the classification network, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.

Description

Behavior recognition and model training method, apparatus, storage medium, and program product
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning techniques, which are particularly applicable to smart cities and smart traffic scenarios, and in particular to a method, device, storage medium, and program product for behavior recognition and model training.
Background
Most of the existing human behavior recognition methods use a classification model based on a Convolutional Neural Network (CNN) framework to extract image features, and perform human behavior classification recognition. Classification models based on CNN frameworks can handle short term dependencies well but perform poorly for long term dependencies, and thus when applied to the image domain, can capture features well for smaller local regions but do so poorly for larger regions.
For the interactive behaviors between a person and an object (substantially human behaviors), for example, the interactive behaviors between the person and a soccer ball on a soccer field (such as kicking the ball by the person, holding the ball by the person, etc.), the interactive behaviors between the person and a vehicle on a road (such as riding a motorcycle by the person, etc.), etc., when the distance between the person and the object is relatively long, the recognition effect of the interactive behaviors between the person and the object using the classification model based on the CNN frame is poor, and the recognition accuracy is low.
Disclosure of Invention
The present disclosure provides a method, apparatus, storage medium, and program product for behavior recognition and model training.
According to a first aspect of the present disclosure, there is provided a method of behavior recognition, comprising:
inputting an image to be processed into a feature extraction network of a behavior recognition model, and extracting image features of the image;
and coding the image features and the position information corresponding to the image features, inputting the image features and the position information into a classification network of the behavior recognition model based on a multi-branch sub-attention mechanism, and recognizing the behavior information of the human body contained in the image through the classification network.
According to a second aspect of the present disclosure, there is provided a method of behavior recognition model training, wherein the behavior recognition model includes a feature extraction network and a classification network based on a multi-branch self-attention mechanism, the method including:
inputting a sample image into the feature extraction network, and extracting image features of the sample image;
coding the image features and the position information corresponding to the image features, inputting the coded image features into the classification network based on the multi-branch sub-attention mechanism, and identifying the behavior information of the human body contained in the sample image through the classification network to obtain prediction data of the behavior information;
and calculating a loss value according to the real data and the predicted data of the behavior information, and updating the parameters of the behavior recognition model according to the loss value.
According to a third aspect of the present disclosure, there is provided an apparatus of behavior recognition, comprising:
the characteristic extraction module is used for inputting the image to be processed into a characteristic extraction network of the behavior recognition model and extracting the image characteristics of the image;
and the behavior recognition module is used for coding the image features and the position information corresponding to the image features, inputting the coded image features into a classification network of the behavior recognition model based on a multi-branch sub attention mechanism, and recognizing the behavior information of the human body contained in the image through the classification network.
According to a fourth aspect of the present disclosure, there is provided an apparatus for behavior recognition model training, wherein the behavior recognition model includes a feature extraction network and a classification network based on a multi-branch self-attention mechanism, the apparatus comprising:
the characteristic extraction module is used for inputting a sample image into the characteristic extraction network and extracting the image characteristics of the sample image;
the behavior identification module is used for encoding the image features and the position information corresponding to the image features, inputting the image features and the position information into the classification network based on the multi-branch sub-attention mechanism, and identifying the behavior information of the human body contained in the sample image through the classification network to obtain the prediction data of the behavior information;
the loss calculation module is used for calculating a loss value according to the real data and the predicted data of the behavior information;
and the parameter updating module is used for updating the parameters of the behavior recognition model according to the loss value.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the above aspects.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of the above aspects.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of any of the above aspects.
According to the technology disclosed by the invention, the accuracy of human behavior recognition is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary diagram of the general structure of a behavior recognition model provided by the present disclosure;
FIG. 2 is a flow chart of a method of behavior recognition provided by a first embodiment of the present disclosure;
FIG. 3 is a flow chart of a method of behavior recognition provided by a second embodiment of the present disclosure;
FIG. 4 is a flowchart of a method for training a behavior recognition model according to a third embodiment of the present disclosure;
FIG. 5 is a flowchart of a method for training a behavior recognition model according to a fourth embodiment of the present disclosure;
fig. 6 is a schematic diagram of an apparatus for behavior recognition provided by a fifth embodiment of the present disclosure;
fig. 7 is a schematic diagram of a behavior recognition device provided by a sixth embodiment of the present disclosure;
FIG. 8 is a schematic diagram of an apparatus for behavior recognition model training according to a seventh embodiment of the present disclosure;
FIG. 9 is a schematic diagram of an apparatus for training a behavior recognition model according to an eighth embodiment of the present disclosure;
FIG. 10 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
With the development and application of the related technology of artificial intelligence, strong demands for intelligent and automatic technologies emerge in more and more fields, wherein one of the fields of monitoring and security is the field. In the field of monitoring and security protection, human behavior analysis and identification are very key technologies. In the field of monitoring and security protection, the human behavior analysis and recognition function is widely applied. For example, an organization needs to analyze and identify pedestrian behavior information in real time according to a monitoring video and maintain public safety; some departments need to identify the dressing and behaviors of workers based on human behavior identification technology, and warn the workers who do not conform to dressing standards and have non-standard behaviors (such as smoking, playing mobile phones and the like). However, most of the current human behavior analysis and identification schemes are not high in accuracy, or the model is designed in a customized mode aiming at a certain scene, so that the model is not high in reusability.
Most of the existing human behavior recognition methods use a classification model based on a Convolutional Neural Network (CNN) framework to extract image features, and perform human behavior classification recognition. Classification models based on CNN frameworks can handle short term dependencies well but perform poorly for long term dependencies, and thus when applied to the image domain, can capture features well for smaller local regions but do so poorly for larger regions.
For the interactive behaviors between a person and an object (substantially human behaviors), for example, the interactive behaviors between the person and a soccer ball on a soccer field (such as kicking the ball by the person, holding the ball by the person, etc.), the interactive behaviors between the person and a vehicle on a road (such as riding a motorcycle by the person, etc.), etc., when the distance between the person and the object is relatively long, the recognition effect of the interactive behaviors between the person and the object using the classification model based on the CNN frame is poor, and the recognition accuracy is low.
The present disclosure provides a behavior recognition and model training method, device, storage medium, and program product, which relate to the field of artificial intelligence, and in particular to computer vision and deep learning technologies, and are particularly applicable to smart cities and smart traffic scenes to improve the accuracy of human behavior recognition.
The general structure of the behavior recognition model provided by the present disclosure is shown in fig. 1, and includes: a feature extraction network 10, a position encoding module 20 and a classification network 30 based on a multi-branch sub-attention mechanism. The feature extraction Network 10 is a backbone Network of a behavior recognition model, and may be implemented by a Convolutional Neural Network (CNN for short) for extracting image features of an input image. The structure of the classification network 30 based on the multi-branch sub-attention mechanism may include: an information encoder, an information decoder and a Feed-Forward neural Network (FFN). And the position coding module is used for determining the position information code corresponding to the extracted image characteristic size. The classification network based on the multi-branch sub-attention mechanism is used for identifying the behavior information of the human body contained in the input image according to the image feature and the position information code. By utilizing the model based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the input image is identified, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.
Fig. 2 is a flowchart of a behavior recognition method according to a first embodiment of the present disclosure. The method provided in this embodiment may specifically be an electronic device for performing behavior recognition, and may be a terminal device or a server, for example, a terminal device or a monitoring server of a monitoring system. In other embodiments, the electronic device may also be implemented in other manners, and this embodiment is not specifically limited herein.
As shown in fig. 2, the method comprises the following specific steps:
step S201, inputting the image to be processed into a feature extraction network of the behavior recognition model, and extracting the image features of the image.
In this embodiment, the behavior recognition model includes a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. And extracting image features of the image by the feature extraction network user. The classification network based on the multi-branch sub attention mechanism is used for identifying the behavior information of the human body contained in the image based on the image features and the position information codes corresponding to the image features.
The image to be processed may be a received image or an image acquired in real time by an image acquisition device, and the image to be processed includes one or more human bodies. The image to be processed may or may not contain one or more objects.
When the method is applied to different application scenes, the manner of acquiring the image to be processed may be different, and may be set and adaptively adjusted according to the specific application scene, which is not specifically limited in this embodiment.
After the image to be processed is acquired, the image to be processed is input into a behavior recognition model, and image features of the image are extracted through a feature extraction network of the behavior recognition model.
In addition, before the image features are input into the classification network based on the multi-branch sub attention mechanism, the position information codes corresponding to the image features need to be acquired.
And the position information code corresponding to the image characteristic is used for recording the position information of each pixel.
Step S202, encoding the image characteristics and the position information corresponding to the image characteristics, inputting a classification network of a behavior recognition model based on a multi-branch sub attention mechanism, and recognizing the behavior information of the human body contained in the image through the classification network.
In the step, the image characteristics and the position information codes corresponding to the image characteristics are input into a classification network, the classification network identifies human behaviors, and an identification result is output.
Wherein, the recognition result comprises the behavior information of the human body contained in the image. The behavior information of the human body is used for describing the interactive behavior of the human body and the object, and the behavior information comprises the position information of the human body, the position information and the category of the object and the behavior category of the interactive behavior.
In this embodiment, the classification network in the behavior recognition model is a neural network model based on a multi-branch attention mechanism (transformer). In addition, the behavior recognition model is a model trained in advance, and can be obtained by training by using the method provided in the third embodiment or the fourth embodiment.
In this embodiment, the behavior recognition model includes a behavior recognition model including a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. Inputting an image to be processed into a feature extraction network to extract image features of the image; the image characteristics and the position information codes corresponding to the image characteristics are input into the classification network based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the image is identified through the classification network, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.
Fig. 3 is a flowchart of a behavior recognition method according to a second embodiment of the present disclosure. On the basis of the first embodiment, in this embodiment, the image features and the position information are encoded, and input to a classification network based on a multi-branch attention mechanism, after the behavior information of the human body included in the image is identified through the classification network, according to the behavior information of the human body included in the image, if it is determined that a behavior class in any behavior information belongs to a preset behavior set, the behavior information of which the behavior class belongs to the preset behavior set is correspondingly processed.
As shown in fig. 3, the method comprises the following specific steps:
and step S300, acquiring an image to be processed.
Wherein, the image to be processed comprises one or more human bodies. The image to be processed may or may not contain one or more objects.
In this step, the image to be processed may be received, or an image including a person (human body) in a specified place may be acquired in real time by the image acquisition device as the image to be processed.
When the method is applied to different application scenes, the manner of acquiring the image to be processed may be different, and may be set and adaptively adjusted according to the specific application scene, which is not specifically limited in this embodiment.
Step S301, inputting the image to be processed into a feature extraction network of the behavior recognition model, and extracting the image features of the image.
After the image to be processed is acquired, the image to be processed is input into a feature extraction network of the behavior recognition model, and image features of the image are extracted through the feature extraction network.
The feature extraction network is a backbone network of the behavior recognition model, can be implemented by adopting a convolutional neural network, and is used for extracting image features of the input image.
For any image to be processed, the image is composed of a large number of pixels, the information of the pixels is very low-level, and a convolutional neural network is used for extracting deep-level features of the image.
Illustratively, the feature extraction network may be implemented by using ResNet50, VGG, LeNet, or other convolutional neural networks, and the embodiment is not limited in detail here.
And S302, performing dimensionality reduction on the image features to obtain preset-dimensionality image features.
The information encoder and information decoder of the classification network, which is usually based on the multi-branch sub attention mechanism, have a complicated process and are computationally expensive.
In this embodiment, after the image features of the image to be processed are extracted by using the feature extraction network, the image features may be reduced in dimension to be preset dimensions to obtain image features of the preset dimensions, so as to reduce the amount of calculation of the classification network based on the multi-branch sub attention mechanism, reduce the running time, and improve the efficiency of behavior identification.
The preset dimension may be set and adjusted according to the needs of the actual application scenario, and this embodiment is not specifically limited here.
Alternatively, the image features may be reduced in dimension by a 1 x 1 convolution to obtain image features of a predetermined dimension.
And S303, carrying out position coding on pixels in the image characteristics according to the preset dimensionality to obtain position information codes corresponding to the image characteristics.
The conventional classification network based on the multi-branch sub-attention mechanism is applied to natural language processing, and position information is not contained in input features.
In this embodiment, in order to increase the position information of the image, before the image features are input into the classification network based on the multi-branch sub attention mechanism, the position of each pixel in the image features needs to be encoded, and a position information code corresponding to the image features needs to be generated. The image features are subsequently input into a classification network based on a multi-branch sub-attention mechanism, along with corresponding positional information encodings.
And the position information code corresponding to the image characteristic is used for recording the position information of each pixel.
Illustratively, the position information code may be stored in a position map manner and may be sized in accordance with the image characteristics. The location map includes the coordinate location of each pixel in the image feature (feature map).
In addition, the position information coding is only used for carrying out position coding on each pixel in the feature map with the preset dimension size, and the position information coding corresponding to different feature maps with the preset dimension is the same. And after the position coding mode is determined, the image features with preset dimensionality correspond to the fixed position information codes. I.e. the position information encoding is related to the preset dimension and the encoding mode.
Since the preset dimension is preset, the process of acquiring the image features of the preset dimension in step S303 may be performed in parallel with the process of acquiring the image features of the preset dimension in steps S301 to S302; alternatively, step S303 may also be performed before steps S301 to S302, and the sequence of the two processes is not specifically limited in this embodiment.
After the image features with preset dimensions and the position information codes corresponding to the image features are obtained, the image features and the position information codes are input into a classification network based on a multi-branch sub attention mechanism through steps S304-S305, and the behavior information of the human body contained in the input image is identified through the classification network.
And S304, encoding the image characteristics and the position information, inputting a classification network based on a multi-branch sub attention mechanism, and outputting the behavior information with the specified quantity and the confidence corresponding to each behavior information through the classification network.
After the image features of the preset dimension and the position information codes corresponding to the image features are obtained, the image features and the position information codes corresponding to the image features can be spliced to obtain input features of the classification network based on the multi-branch sub attention mechanism, and the input features are input into the classification network based on the multi-branch sub attention mechanism.
The behavior information is used for describing the interactive behavior of the human body and the object. The behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
For example, the position information of the human body may be a human body frame for framing the position of the human body. The position information of the object may be an object frame for framing a position of the object.
In this embodiment, the classification network based on the multi-branch sub attention mechanism may be a transform model. As shown in fig. 1, the classification network based on the multi-branch sub-attention mechanism may include: an information encoder (encoder), an information decoder (decoder) and a feed-forward neural network (FFN).
Where the input to the position encoder includes three linear transform pairs Q, K, V. In this embodiment, the position information codes corresponding to the image features and the image features form linear transformation pairs, and all the three linear transformation pairs Q, K, V of the information decoder are linear transformation pairs formed by the position information codes corresponding to the image features and the image features, that is, all the three linear transformation pairs Q, K, V of the information decoder are the same.
Illustratively, the image features and the position information codes corresponding to the image features are sent to an information encoder, the information encoder is composed of a multi-branch self-attention mechanism, the input Q, K, V are all linear transformation pairs composed of the image features and the position information codes corresponding to the image features, and then after residual connection and layer batch processing, intermediate features are obtained; sending the intermediate features to a simple feedforward neural network, and then obtaining stage results through residual connection and layer batch processing; and repeating the above process for 2 times to obtain the hidden features in the encoding stage, and sending to an information decoder. The information decoder decodes the hidden features in the encoding stage to obtain the hidden features in the decoding stage, the hidden features are sent to a feed-forward neural network (FFN), and four vectors are obtained after passing through the feed-forward neural network (FFN) and respectively correspond to the human body position information, the object type and the behavior type. These four vectors can be used as a triplet of behavior information of a human body: human body position information, object position information and category, behavior category.
The behavior category is used for describing the behavior relation between the person and the object. All behavior types that can be recognized by the behavior recognition model may be set according to an actual application scenario, and this embodiment is not specifically limited here.
In this embodiment, the classification network based on the multi-branch sub-attention mechanism outputs a specified number of behavior information (triplets) and a confidence corresponding to each behavior information through the classification network.
The designated number is a fixed number set during the training of the behavior recognition model, and may be set to be larger than the actual maximum possible number of human bodies, and the specific number may be set and adjusted according to the needs of the actual application scenario, which is not specifically limited in this embodiment.
The output result of each time of the classification network based on the multi-branch sub attention mechanism contains a specified amount of behavior information.
Step S305, using the behavior information whose confidence is greater than the confidence threshold as the behavior information of the human body included in the image.
In this embodiment, each output result of the classification network based on the multi-branch sub-attention mechanism includes a specified amount of behavior information and a confidence corresponding to each behavior information.
The confidence corresponding to each behavior information is used for measuring the accuracy of the prediction of the behavior information, the higher the confidence is, the higher the accuracy of the behavior information is, and the lower the confidence is, the lower the accuracy of the behavior information is.
In this step, behavior information whose confidence is greater than the confidence threshold is used as behavior information of a human body included in the input image, based on the confidence threshold.
The confidence threshold may be set and adjusted according to the needs of the actual application scenario, and is not specifically limited herein.
In this embodiment, the classification network based on the multi-branch attention mechanism is a trained model, and may be obtained by training using the method provided in the third embodiment or the fourth embodiment.
The behavior identification method provided by the embodiment can be applied to various different application scenes, for example, whether workers in an office have irregular behaviors (such as smoking, playing mobile phones and the like) or not is identified, and whether people which do not accord with dressing specifications or not exist in a specified place is identified; identifying whether behaviors endangering public safety exist in a specified place, and the like.
Step S306, according to the behavior information of the human body contained in the image, if the behavior type in any behavior information is determined to belong to the preset behavior set, the behavior information of which the behavior type belongs to the preset behavior set is correspondingly processed.
In practical applications, when the method is applied to different application scenes, after the behavior information of the human body included in the image is recognized, the processing may be performed differently according to the behavior information of the human body included in the image.
In an optional implementation manner, a preset behavior set may be set, where the preset behavior set includes behavior categories that need to be processed correspondingly, may belong to behavior categories of non-standard behaviors, may also belong to behavior categories of explicit behaviors, and the like, and may be set according to needs of an actual application scenario, where the setting is not specifically limited herein.
After the behavior information of the human body contained in the image is identified, whether the behavior category of the behavior information of the human body contained in the image belongs to the preset behavior set or not can be judged.
If the behavior category in any behavior information belongs to the preset behavior set, the behavior information of which the behavior category belongs to the preset behavior set needs to be correspondingly processed.
Optionally, an action report is generated according to the action information of the action category belonging to the preset action set, and the action report is stored or sent according to a preset rule, so that relevant personnel can take corresponding processing measures after manual review.
For example, behavior recognition is performed on multi-frame images in a teaching video of a teacher, irregular behaviors of the teacher are recognized, a behavior report is recorded and generated, and the behavior report is stored for relevant personnel to perform corresponding processing after checking.
Optionally, according to behavior information of which the behavior category belongs to the preset behavior set, early warning information can be directly sent out, and the behavior of the behavior category in the preset behavior set appears in the real-time early warning image, so that the function of online real-time early warning is realized.
For example, after the smoking behavior of a person is recognized in an image collected in an office, a voice prompt can be output through an audio device to provide that smoking is prohibited in the office to standardize the behavior of the person.
In practical application, when the method is applied to different application scenarios, after the behavior information of the human body included in the image is identified, the processing may be different according to the behavior information of the human body included in the image, and the configuration and the adjustment may be performed according to the needs of a specific application scenario, which is not specifically limited in this embodiment.
In this embodiment, the pre-trained behavior recognition model includes a behavior recognition model including a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. Inputting an image to be processed into a feature extraction network to extract image features of the image; the image characteristics and the position information codes corresponding to the image characteristics are input into the classification network based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the image is identified through the classification network, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.
Fig. 4 is a flowchart of a method for training a behavior recognition model according to a third embodiment of the present disclosure. The method provided in this embodiment may specifically be an electronic device for performing behavior recognition model training, and may be a terminal device or a server, in other embodiments, the electronic device may also be implemented in other manners, and this embodiment is not specifically limited herein.
In the present embodiment, the overall structure of the behavior recognition model is as shown in fig. 1, and the behavior recognition model includes a feature extraction network 10 and a classification network 30 based on a multi-branch self-attention mechanism.
The feature extraction Network 10 is a backbone Network of a behavior recognition model, and may be implemented by a Convolutional Neural Network (CNN for short) for extracting image features of an input image.
Illustratively, the feature extraction network may be implemented by using ResNet50, VGG, LeNet, or other convolutional neural networks, and the embodiment is not limited in detail here.
The structure of the classification network 30 based on the multi-branch sub-attention mechanism may include: an information encoder, an information decoder, and a feed-forward neural network. The classification network based on the multi-branch sub-attention mechanism is used for identifying the behavior information of the human body contained in the input image according to the image feature and the position information code.
Illustratively, the classification network based on the multi-branch sub-attention mechanism may be a transform model. As shown in fig. 1, the classification network based on the multi-branch sub-attention mechanism may include: an information encoder (encoder), an information decoder (decoder) and a feed-forward neural network (FFN).
In addition, as shown in fig. 1, the behavior recognition model may further include a position encoding module 20 for generating a position information encoding of a preset dimension.
As shown in fig. 4, the method comprises the following specific steps:
step S401, inputting the sample image into a feature extraction network, and extracting the image features of the sample image.
In this embodiment, the behavior recognition model is trained based on training data. The training data includes a plurality of sample images and real data of behavior information of a human body included in each sample image.
And during each training, inputting the sample image into a feature extraction network of the behavior recognition model, and extracting the image features of the sample image through the feature extraction network.
In addition, before the sample image features are input into the classification network based on the multi-branch sub attention mechanism, the position information codes corresponding to the image features need to be acquired.
And the position information code corresponding to the image characteristic is used for recording the position information of each pixel.
And S402, encoding the image characteristics and the position information corresponding to the image characteristics, inputting a classification network based on a multi-branch sub-attention mechanism, and identifying the behavior information of the human body contained in the sample image through the classification network to obtain prediction data of the behavior information.
In the step, the image characteristics and the position information codes corresponding to the image characteristics are input into a classification network, the classification network identifies human behaviors, and a prediction result is output. The prediction result includes prediction data of behavior information of a human body included in the image.
The behavior information of the human body is used for describing the interactive behavior of the human body and the object, and the behavior information comprises the position information of the human body, the position information and the category of the object and the behavior category of the interactive behavior.
And S403, calculating a loss value according to the real data and the predicted data of the behavior information.
And calculating a loss value according to the real data and the prediction data of the behavior information after obtaining the prediction data value of the behavior information of the human body contained in the sample image.
In this step, the loss value may be calculated by using one or more common loss functions, which are not specifically limited herein.
For example, the L1 loss, IOU loss, etc. may be calculated, or at least two losses may be calculated, and the two losses are combined to obtain a final loss value.
And S404, updating parameters of the behavior recognition model according to the loss value.
After the loss value is calculated, parameters of the behavior recognition model are updated according to the loss value.
The embodiment provides a training method of a behavior recognition model, wherein the behavior recognition model comprises a behavior recognition model comprising a feature extraction network and a classification network based on a multi-branch sub-attention mechanism, the classification network based on the multi-branch sub-attention mechanism is used for recognizing the behavior information of a human body contained in an image according to image features and position information codes corresponding to the image features, and the trained behavior recognition model can well solve the problem of long-term dependence, is more accurate in recognizing the interaction behavior between a person and an object, and improves the accuracy of human body behavior recognition.
Fig. 5 is a flowchart of a method for training a behavior recognition model according to a fourth embodiment of the present disclosure. On the basis of the third embodiment, in this embodiment, calculating the loss value according to the real data and the predicted data of the behavior information includes: matching the predicted data with the real data, and determining the real data corresponding to the predicted data to minimize the comprehensive loss of the predicted data and the corresponding real data; and taking the comprehensive loss of the predicted data and the corresponding real data as a loss value.
As shown in fig. 5, the method comprises the following specific steps:
step S501, a sample image and annotation data corresponding to the sample image are obtained, wherein the annotation data comprise real data of behavior information of a human body contained in the sample image.
Before model training, training data is first acquired. The training data comprises a plurality of sample images and annotation data corresponding to each sample image, the annotation data comprises real data of human behavior information contained in the sample images, a large amount of training data is provided for subsequent model training, and a behavior recognition model with high accuracy is obtained through training.
The behavior information is used for describing the interactive behavior of the human body and the object. The behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
After the training data are obtained, through multiple iteration steps S502-S507, the training data are used for training the behavior recognition model until an iteration stop condition is met, and the trained behavior recognition model is obtained. The trained behavior recognition model is used for recognizing the behavior information of the human body contained in the input image.
Wherein, the iteration stop condition includes but is not limited to: the accuracy of the behavior recognition model meets the requirement, the iteration times reach the preset times, and the iteration time reaches the preset duration. The iteration stop condition may be set and adjusted according to the needs of the actual application scenario, and is not specifically limited herein.
Step S502, inputting the sample image into a feature extraction network, and extracting the image features of the sample image.
And during each training, inputting the sample image into a feature extraction network of the behavior recognition model, and extracting the image features of the sample image through the feature extraction network.
The feature extraction network is a backbone network of the behavior recognition model, can be implemented by adopting a convolutional neural network, and is used for extracting image features of the input image.
For any image to be processed, the image is composed of a large number of pixels, the information of the pixels is very low-level, and a convolutional neural network is used for extracting deep-level features of the image.
Illustratively, the feature extraction network may be implemented by using ResNet50, VGG, LeNet, or other convolutional neural networks, and the embodiment is not limited in detail here.
And S503, performing dimensionality reduction on the image features to obtain preset-dimensionality image features.
The information encoder and information decoder of the classification network, which is usually based on the multi-branch sub attention mechanism, have a complicated process and are computationally expensive.
In this embodiment, after the image features of the sample image are extracted by using the feature extraction network, the image features may be subjected to dimensionality reduction to obtain image features of the preset dimensionality, so as to reduce the calculation amount of the classification network based on the multi-branch sub attention mechanism, reduce the running time, and improve the efficiency of model training and behavior recognition.
The preset dimension may be set and adjusted according to the needs of the actual application scenario, and this embodiment is not specifically limited here.
Alternatively, the image features may be reduced in dimension by a 1 x 1 convolution to obtain image features of a predetermined dimension.
And step S504, according to the preset dimensionality, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
The conventional classification network based on the multi-branch sub-attention mechanism is applied to natural language processing, and position information is not contained in input features.
In this embodiment, in order to increase the position information of the image, before the image features are input into the classification network based on the multi-branch sub attention mechanism, the position of each pixel in the image features needs to be encoded, and a position information code corresponding to the image features needs to be generated. The image features are subsequently input into a classification network based on a multi-branch sub-attention mechanism, along with corresponding positional information encodings.
And the position information code corresponding to the image characteristic is used for recording the position information of each pixel.
Illustratively, the position information code may be stored in a position map manner and may be sized in accordance with the image characteristics. The location map includes the coordinate location of each pixel in the image feature (feature map).
In addition, the position information coding is only used for carrying out position coding on each pixel in the feature map with the preset dimension size, and the position information coding corresponding to different feature maps with the preset dimension is the same. And after the position coding mode is determined, the image features with preset dimensionality correspond to the fixed position information codes. I.e. the position information encoding is related to the preset dimension and the encoding mode.
Since the preset dimension is preset, the process of acquiring the image features of the preset dimension in step S303 may be performed in parallel with the process of acquiring the image features of the preset dimension in steps S301 to S302; alternatively, step S303 may also be performed before steps S301 to S302, and the sequence of the two processes is not specifically limited in this embodiment.
And step S505, encoding the image characteristics and the position information corresponding to the image characteristics, inputting the encoded image characteristics into a classification network, and outputting the specified amount of prediction data through the classification network.
After the image features of the preset dimension and the position information codes corresponding to the image features are obtained, the image features and the position information codes corresponding to the image features can be spliced to obtain input features of the classification network based on the multi-branch sub attention mechanism, and the input features are input into the classification network based on the multi-branch sub attention mechanism.
For example, the position information of the human body may be a human body frame for framing the position of the human body. The position information of the object may be an object frame for framing a position of the object.
In this embodiment, the classification network based on the multi-branch sub attention mechanism may be a transform model. As shown in fig. 1, the classification network based on the multi-branch sub-attention mechanism may include: an information encoder (encoder), an information decoder (decoder) and a feed-forward neural network (FFN).
Where the input to the position encoder includes three linear transform pairs Q, K, V. In this embodiment, the position information codes corresponding to the image features and the image features form linear transformation pairs, and all the three linear transformation pairs Q, K, V of the information decoder are linear transformation pairs formed by the position information codes corresponding to the image features and the image features, that is, all the three linear transformation pairs Q, K, V of the information decoder are the same.
The specific structure and function of the classification network based on the multi-branch attention mechanism can be referred to the specific result and function of the transform model, and are not described herein again.
Illustratively, the image features and the position information codes corresponding to the image features are sent to an information encoder, the information encoder is composed of a multi-branch self-attention mechanism, the input Q, K, V are all linear transformation pairs composed of the image features and the position information codes corresponding to the image features, and then after residual connection and layer batch processing, intermediate features are obtained; sending the intermediate features to a simple feedforward neural network, and then obtaining stage results through residual connection and layer batch processing; and repeating the above process for 2 times to obtain the hidden features in the encoding stage, and sending to an information decoder. The information decoder decodes the hidden features in the encoding stage to obtain the hidden features in the decoding stage, the hidden features are sent to a feed-forward neural network (FFN), and four vectors are obtained after passing through the feed-forward neural network (FFN) and respectively correspond to the human body position information, the object type and the behavior type. These four vectors can be used as a triplet of behavior information of a human body: human body position information, object position information and category, behavior category.
The behavior category is used for describing the behavior relation between the person and the object. All behavior types that can be recognized by the behavior recognition model may be set according to an actual application scenario, and this embodiment is not specifically limited here.
In this embodiment, the classification network based on the multi-branch sub-attention mechanism outputs a specified number of behavior information (triplets) and a confidence corresponding to each behavior information through the classification network.
The confidence corresponding to each behavior information is used for measuring the accuracy of the behavior information, the higher the confidence is, the higher the accuracy of the behavior information is, and the lower the confidence is, the lower the accuracy of the behavior information is.
The designated number is a preset fixed number, and may be set to be greater than the actual maximum possible number of human bodies, and the specific number may be set and adjusted according to the needs of the actual application scenario, which is not specifically limited in this embodiment.
The output result of each time of the classification network based on the multi-branch sub attention mechanism contains a specified amount of behavior information.
And step S506, calculating a loss value according to the real data and the predicted data of the behavior information.
And calculating a loss value according to the real data and the prediction data of the behavior information after obtaining the prediction data value of the behavior information of the human body contained in the sample image.
In this embodiment, the annotation data of the sample image includes one or more pieces of real data, each piece of real data corresponds to one piece of behavior information, and the real information of the triple of behavior information includes real information of human body position information, object type, and behavior type.
Optionally, matching the predicted data with the real data, and determining the real data corresponding to the predicted data, so that the comprehensive loss of the predicted data and the corresponding real data is minimum; and taking the comprehensive loss of the predicted data and the corresponding real data as a loss value.
By matching the predicted data and the actual data, an optimal match of a set of predicted data to the actual data can be determined that minimizes the combined loss of the predicted data and the corresponding actual data.
Optionally, the hungarian algorithm is used for matching the predicted data and the real data, and the real data corresponding to the predicted data is determined, so that the comprehensive loss of the predicted data and the corresponding real data is minimum, and the optimal matching between a group of predicted data and the real data can be accurately determined, wherein the optimal matching enables the comprehensive loss of the predicted data and the corresponding real data to be minimum.
Illustratively, the loss between the predicted data (triple) and the corresponding real data (triple) can be used as the distance between the predicted data (triple) and the corresponding real data (triple), the predicted data and the real data are matched by using the hungarian algorithm, and the real data corresponding to the predicted data is determined, so that the comprehensive distance between the predicted data and the corresponding real data is minimum.
In addition, since the number of pieces of real data of the sample image is smaller than or equal to the number of pieces of predicted data of the sample image (i.e., the specified number). And if the number of the real data of the sample image is less than the specified number, expanding the real data of the sample image to the specified number by using the empty set, so that the one-to-one corresponding relation between the real data and the predicted data can be realized when the real data corresponding to the sample image is matched with the predicted data.
In this embodiment, the comprehensive loss of the predicted data and the corresponding real data is: the sum of the losses between each prediction data and the corresponding real data.
The loss between each prediction data and the corresponding real data is determined according to one or more loss function values between the prediction data and the corresponding real data, so that the effectiveness and rationality of the loss can be improved, and the accuracy of the recognition model obtained by training is improved.
Alternatively, for each predicted data and corresponding real data, the L1 loss between the predicted data and the real data may be calculated as the loss between the predicted data and the corresponding real data.
Alternatively, the IOU loss between the predicted data and the real data may be calculated as the loss between the predicted data and the corresponding real data.
Alternatively, L1 loss and IOU loss between the predicted data and the real data may be calculated, and hungarian loss is comprehensively calculated from L1 loss and IOU loss as loss between the predicted data and the corresponding real data.
And step S507, minimizing the loss value by using a gradient descent algorithm, and updating the parameters of the behavior recognition model.
After the loss value is calculated, parameters of the behavior recognition model are updated according to the loss value.
The step provides an optional implementation mode, loss values are minimized by using a gradient descent algorithm, parameters of the behavior recognition model are updated, model training is achieved, and accuracy of the trained behavior recognition model is improved.
The embodiment provides a training method of a behavior recognition model, wherein the behavior recognition model comprises a behavior recognition model comprising a feature extraction network and a classification network based on a multi-branch sub-attention mechanism, the classification network based on the multi-branch sub-attention mechanism is used for recognizing the behavior information of a human body contained in an image according to image features and position information codes corresponding to the image features, and the trained behavior recognition model can well solve the problem of long-term dependence, is more accurate in recognizing the interaction behavior between a person and an object, and improves the accuracy of human body behavior recognition. Furthermore, in the model training process, a Hungarian algorithm is used for matching the predicted data and the real data, and the real data corresponding to the predicted data are determined, so that the comprehensive loss of the predicted data and the corresponding real data is minimum, the optimal matching of a group of predicted data and the real data can be accurately determined, and the optimal matching enables the comprehensive loss of the predicted data and the corresponding real data to be minimum; and (3) minimizing the loss value by using a gradient descent algorithm, updating parameters of the behavior recognition model, and training to obtain the behavior recognition model with high accuracy.
Fig. 6 is a schematic diagram of a behavior recognition device provided in a fifth embodiment of the present disclosure. The behavior recognition device provided by the embodiment of the disclosure can execute the processing flow provided by the method embodiment of the behavior recognition. As shown in fig. 6, the behavior recognition apparatus 60 includes: a feature extraction module 601 and a behavior recognition module 602.
Specifically, the feature extraction module 601 is configured to input an image to be processed into a feature extraction network of the behavior recognition model, and extract image features of the image.
The behavior recognition module 602 is configured to encode the image features and the position information corresponding to the image features, input the image features into a classification network of the behavior recognition model based on a multi-branch attention mechanism, and recognize behavior information of a human body included in the image through the classification network.
The device provided in the embodiment of the present disclosure may be specifically configured to execute the method embodiment provided in the first embodiment, and specific functions are not described herein again.
In this embodiment, the behavior recognition model includes a behavior recognition model including a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. Inputting an image to be processed into a feature extraction network to extract image features of the image; the image characteristics and the position information codes corresponding to the image characteristics are input into the classification network based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the image is identified through the classification network, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.
Fig. 7 is a schematic diagram of a device for behavior recognition provided by a sixth embodiment of the present disclosure. The behavior recognition device provided by the embodiment of the disclosure can execute the processing flow provided by the method embodiment of the behavior recognition. As shown in fig. 7, the behavior recognition device 70 includes: a feature extraction module 701 and a behavior recognition module 702.
Specifically, the feature extraction module 701 is configured to input an image to be processed into a feature extraction network of the behavior recognition model, and extract an image feature of the image.
And the behavior identification module 702 is configured to encode the image features and the position information corresponding to the image features, input the encoded image features into a classification network of the behavior identification model based on a multi-branch sub-attention mechanism, and identify behavior information of a human body included in the image through the classification network.
Optionally, the behavior recognition module is further configured to:
coding the image characteristics and the position information, inputting the coded image characteristics and the position information into a classification network, and outputting a specified amount of behavior information and a confidence corresponding to each behavior information through the classification network; and taking the behavior information with the confidence coefficient larger than the confidence coefficient threshold value as the behavior information of the human body contained in the image.
Optionally, the feature extraction module is further configured to:
inputting an image to be processed into a feature extraction network of the behavior recognition model, extracting image features of the image, and then performing dimensionality reduction on the image features to obtain preset-dimensionality image features.
Optionally, as shown in fig. 7, the behavior recognition device 70 may further include: a position encoding module 703 configured to:
and according to the preset dimensionality, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
Optionally, the behavior information is used to describe an interactive behavior of the human body and the object, and the behavior information includes position information of the human body, position information and category of the object, and a behavior category of the interactive behavior.
Optionally, as shown in fig. 7, the behavior recognition device 70 may further include: an image acquisition module 704 configured to:
receiving an image to be processed; or, an image containing a person in a specified place is acquired as an image to be processed.
Optionally, as shown in fig. 7, the behavior recognition device 70 may further include: an identify post-processing module 705 to:
and according to the behavior information of the human body contained in the image, if the behavior type in any behavior information is determined to belong to the preset behavior set, correspondingly processing the behavior information of which the behavior type belongs to the preset behavior set.
Optionally, the identification post-processing module is further configured to:
and generating a behavior report according to the behavior information of the behavior category belonging to the preset behavior set, and storing or sending the behavior report according to a preset rule.
The device provided in the embodiment of the present disclosure may be specifically configured to execute the method embodiment provided in the second embodiment, and specific functions are not described herein again.
In this embodiment, the pre-trained behavior recognition model includes a behavior recognition model including a feature extraction network and a classification network based on a multi-branch sub-attention mechanism. Inputting an image to be processed into a feature extraction network to extract image features of the image; the image characteristics and the position information codes corresponding to the image characteristics are input into the classification network based on the multi-branch sub attention mechanism, the behavior information of the human body contained in the image is identified through the classification network, the problem of long-term dependence can be well solved, the identification of the interaction behavior between the human body and the object is more accurate, and the accuracy of human body behavior identification is improved.
Fig. 8 is a schematic diagram of a device for training a behavior recognition model according to a seventh embodiment of the present disclosure. The equipment for training the behavior recognition model provided by the embodiment of the disclosure can execute the processing flow provided by the method for training the behavior recognition model.
In this embodiment, the behavior recognition model includes a feature extraction network and a classification network based on a multi-branch self-attention mechanism.
As shown in fig. 8, the apparatus 80 for behavior recognition model training includes: a feature extraction module 801, a behavior recognition module 802, a loss calculation module 803, and a parameter update module 804.
Specifically, the feature extraction module 801 is configured to input the sample image into a feature extraction network, and extract an image feature of the sample image.
And a behavior identification module 802, configured to encode the image features and the position information corresponding to the image features, input the encoded image features into a classification network based on a multi-branch sub-attention mechanism, and identify, through the classification network, behavior information of a human body included in the sample image to obtain prediction data of the behavior information.
And a loss calculating module 803, configured to calculate a loss value according to the real data and the predicted data of the behavior information.
A parameter updating module 804, configured to update parameters of the behavior recognition model according to the loss value.
The device provided in the embodiment of the present disclosure may be specifically configured to execute the method embodiment provided in the third embodiment, and specific functions are not described herein again.
The embodiment provides a training method of a behavior recognition model, wherein the behavior recognition model comprises a behavior recognition model comprising a feature extraction network and a classification network based on a multi-branch sub-attention mechanism, the classification network based on the multi-branch sub-attention mechanism is used for recognizing the behavior information of a human body contained in an image according to image features and position information codes corresponding to the image features, and the trained behavior recognition model can well solve the problem of long-term dependence, is more accurate in recognizing the interaction behavior between a person and an object, and improves the accuracy of human body behavior recognition.
Fig. 9 is a schematic diagram of an apparatus for training a behavior recognition model according to an eighth embodiment of the present disclosure. The equipment for training the behavior recognition model provided by the embodiment of the disclosure can execute the processing flow provided by the method for training the behavior recognition model.
In this embodiment, the behavior recognition model includes a feature extraction network and a classification network based on a multi-branch self-attention mechanism.
As shown in fig. 9, the apparatus 90 for behavior recognition model training includes: a feature extraction module 901, a behavior recognition module 902, a loss calculation module 903 and a parameter update module 904.
Specifically, the feature extraction module 901 is configured to input the sample image into a feature extraction network, and extract an image feature of the sample image.
And a behavior identification module 902, configured to encode the image features and the position information corresponding to the image features, input the encoded image features into a classification network based on a multi-branch sub-attention mechanism, and identify, through the classification network, behavior information of a human body included in the sample image to obtain prediction data of the behavior information.
And the loss calculating module 903 is used for calculating a loss value according to the real data and the predicted data of the behavior information.
And a parameter updating module 904 for updating the parameters of the behavior recognition model according to the loss values.
Optionally, the feature extraction module is further configured to:
and inputting the sample image into a feature extraction network, and after extracting the image features of the sample image, performing dimension reduction processing on the image features to obtain preset-dimension image features.
Optionally, as shown in fig. 9, the device 90 for training the behavior recognition model may further include: a position encoding module 905 configured to:
and according to the preset dimensionality, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
Optionally, the behavior recognition module is further configured to:
and coding the image characteristics and the position information corresponding to the image characteristics, inputting the coded image characteristics and the position information into a classification network, and outputting the prediction data of a specified quantity through the classification network.
Optionally, the loss calculation module is further configured to:
matching the predicted data with the real data, and determining the real data corresponding to the predicted data to minimize the comprehensive loss of the predicted data and the corresponding real data; and taking the comprehensive loss of the predicted data and the corresponding real data as a loss value.
Optionally, the loss calculation module is further configured to:
and matching the predicted data and the real data by using a Hungarian algorithm, and determining the real data corresponding to the predicted data, so that the comprehensive loss of the predicted data and the corresponding real data is minimum.
Optionally, the combined loss of the predicted data and the corresponding real data is: a sum of losses between each prediction data and the corresponding real data; wherein the loss between each predicted data and the corresponding real data is determined according to one or more loss function values between the predicted data and the corresponding real data.
Optionally, the parameter updating module is further configured to:
and (5) minimizing the loss value by using a gradient descent algorithm, and updating the parameters of the behavior recognition model.
Optionally, as shown in fig. 9, the device 90 for training the behavior recognition model may further include: a training data acquisition module 906 to:
and acquiring the sample image and annotation data corresponding to the sample image, wherein the annotation data comprises real data of behavior information of a human body contained in the sample image.
Optionally, the behavior information is used to describe an interactive behavior of the human body and the object, and the behavior information includes position information of the human body, position information and category of the object, and a behavior category of the interactive behavior.
The device provided in the embodiment of the present disclosure may be specifically configured to execute the method embodiment provided in the fourth embodiment, and specific functions are not described herein again.
The embodiment provides a training method of a behavior recognition model, wherein the behavior recognition model comprises a behavior recognition model comprising a feature extraction network and a classification network based on a multi-branch sub-attention mechanism, the classification network based on the multi-branch sub-attention mechanism is used for recognizing the behavior information of a human body contained in an image according to image features and position information codes corresponding to the image features, and the trained behavior recognition model can well solve the problem of long-term dependence, is more accurate in recognizing the interaction behavior between a person and an object, and improves the accuracy of human body behavior recognition. Furthermore, in the model training process, a Hungarian algorithm is used for matching the predicted data and the real data, and the real data corresponding to the predicted data are determined, so that the comprehensive loss of the predicted data and the corresponding real data is minimum, the optimal matching of a group of predicted data and the real data can be accurately determined, and the optimal matching enables the comprehensive loss of the predicted data and the corresponding real data to be minimum; and (3) minimizing the loss value by using a gradient descent algorithm, updating parameters of the behavior recognition model, and training to obtain the behavior recognition model with high accuracy.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.
FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 performs the respective methods and processes described above, such as the method XXX. For example, in some embodiments, method XXX may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of method XXX described above may be performed. Alternatively, in other embodiments, computing unit 1001 may be configured to perform method XXX by any other suitable means (e.g., by way of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (39)

1. A method of behavior recognition, comprising:
inputting an image to be processed into a feature extraction network of a behavior recognition model, and extracting image features of the image;
and coding the image features and the position information corresponding to the image features, inputting the image features and the position information into a classification network of the behavior recognition model based on a multi-branch sub-attention mechanism, and recognizing the behavior information of the human body contained in the image through the classification network.
2. The method according to claim 1, wherein the encoding the image features and the position information corresponding to the image features, inputting the encoded image features into a classification network based on a multi-branch sub-attention mechanism of the behavior recognition model, and recognizing the behavior information of the human body included in the image through the classification network comprises:
encoding the image characteristics and the position information, inputting the image characteristics and the position information into the classification network, and outputting a specified amount of behavior information and a confidence corresponding to each behavior information through the classification network;
and taking the behavior information with the confidence coefficient larger than the confidence coefficient threshold value as the behavior information of the human body contained in the image.
3. The method according to claim 1 or 2, wherein the inputting the image to be processed into the feature extraction network of the behavior recognition model, and after extracting the image features of the image, the method further comprises:
and performing dimensionality reduction on the image features to obtain preset-dimensionality image features.
4. The method according to claim 3, wherein the encoding the image features and the position information corresponding to the image features, inputting the encoded image features into a classification network based on a multi-branch sub-attention mechanism of the behavior recognition model, and before recognizing the behavior information of the human body included in the image through the classification network, further comprises:
and according to the preset dimension, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
5. The method of any one of claims 1-4,
the behavior information is used for describing the interactive behavior of the human body and the object,
the behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
6. The method of claim 5, wherein the inputting the image to be processed into a feature extraction network of a behavior recognition model, before extracting the image features of the image, further comprises:
receiving the image to be processed;
alternatively, the first and second electrodes may be,
and acquiring an image containing a person in a specified place as the image to be processed.
7. The method according to claim 5, wherein the encoding the image features and the position information corresponding to the image features, inputting the encoded image features into a classification network based on a multi-branch sub-attention mechanism of the behavior recognition model, and after recognizing the behavior information of the human body included in the image through the classification network, further comprises:
according to the behavior information of the human body contained in the image, if the behavior type in any one of the behavior information is determined to belong to a preset behavior set, the behavior information of which the behavior type belongs to the preset behavior set is correspondingly processed.
8. The method of claim 7, wherein the performing corresponding processing on the behavior information of which behavior category belongs to a preset behavior set comprises:
and generating a behavior report according to the behavior information of which the behavior type belongs to a preset behavior set, and storing or sending the behavior report according to a preset rule.
9. A method of behavior recognition model training, wherein the behavior recognition model comprises a feature extraction network and a multi-tap self-attention mechanism-based classification network, the method comprising:
inputting a sample image into the feature extraction network, and extracting image features of the sample image;
coding the image features and the position information corresponding to the image features, inputting the coded image features into the classification network based on the multi-branch sub-attention mechanism, and identifying the behavior information of the human body contained in the sample image through the classification network to obtain prediction data of the behavior information;
and calculating a loss value according to the real data and the predicted data of the behavior information, and updating the parameters of the behavior recognition model according to the loss value.
10. The method of claim 9, wherein said inputting a sample image into said feature extraction network further comprises, after extracting image features of said sample image:
and performing dimensionality reduction on the image features to obtain preset-dimensionality image features.
11. The method according to claim 10, wherein before encoding the image features and the position information corresponding to the image features, inputting the encoded image features into the classification network based on the multi-branch sub-attention mechanism, and identifying the behavior information of the human body included in the sample image through the classification network to obtain the prediction data of the behavior information, the method further comprises:
and according to the preset dimension, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
12. The method according to any one of claims 9 to 11, wherein the encoding the image features and the position information corresponding to the image features, inputting the encoded image features into the classification network based on the multi-branch sub-attention mechanism, and identifying, through the classification network, behavior information of a human body included in the sample image to obtain prediction data of the behavior information includes:
and coding the image features and the position information corresponding to the image features, inputting the coded image features into the classification network, and outputting the specified amount of prediction data through the classification network.
13. The method of claim 9 or 12, wherein said calculating a loss value from said real data and said predicted data of said behavior information comprises:
matching the predicted data and the real data, and determining the real data corresponding to the predicted data to ensure that the comprehensive loss of the predicted data and the corresponding real data is minimum;
and taking the comprehensive loss of the prediction data and the corresponding real data as a loss value.
14. The method of claim 13, wherein said matching the predicted data and the real data, determining the real data to which the predicted data corresponds such that a combined loss of the predicted data and the corresponding real data is minimized, comprises:
and matching the predicted data and the real data by using a Hungarian algorithm, and determining the real data corresponding to the predicted data to ensure that the comprehensive loss of the predicted data and the corresponding real data is minimum.
15. The method of claim 14, wherein,
the combined loss of the prediction data and the corresponding real data is as follows: a sum of losses between each of the prediction data and corresponding real data;
wherein the loss between each of the prediction data and the corresponding real data is determined according to one or more loss function values between the prediction data and the corresponding real data.
16. The method of any of claims 9-15, wherein the updating parameters of the behavior recognition model according to the loss values comprises:
and minimizing the loss value by using a gradient descent algorithm, and updating parameters of the behavior recognition model.
17. The method of claim 9, wherein before the extracting the image features of the sample image and determining the position information codes corresponding to the image features, the method further comprises:
and acquiring the sample image and annotation data corresponding to the sample image, wherein the annotation data comprises real data of behavior information of a human body contained in the sample image.
18. The method according to any one of claims 9-17, wherein the behavior information is used to describe an interaction behavior of the human body with an object,
the behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
19. An apparatus for behavior recognition, comprising:
the characteristic extraction module is used for inputting the image to be processed into a characteristic extraction network of the behavior recognition model and extracting the image characteristics of the image;
and the behavior recognition module is used for coding the image features and the position information corresponding to the image features, inputting the coded image features into a classification network of the behavior recognition model based on a multi-branch sub attention mechanism, and recognizing the behavior information of the human body contained in the image through the classification network.
20. The device of claim 19, wherein the behavior identification module is further to:
encoding the image characteristics and the position information, inputting the image characteristics and the position information into the classification network, and outputting a specified amount of behavior information and a confidence corresponding to each behavior information through the classification network;
and taking the behavior information with the confidence coefficient larger than the confidence coefficient threshold value as the behavior information of the human body contained in the image.
21. The apparatus of claim 19 or 20, wherein the feature extraction module is further to:
inputting an image to be processed into a feature extraction network of a behavior recognition model, extracting image features of the image, and then performing dimension reduction processing on the image features to obtain preset-dimension image features.
22. The apparatus of claim 21, further comprising: a position encoding module to:
and according to the preset dimension, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
23. The apparatus according to any one of claims 19-22, wherein the behavior information is used to describe an interaction behavior of the human body with an object,
the behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
24. The apparatus of claim 23, further comprising: an image acquisition module to:
receiving the image to be processed;
alternatively, the first and second electrodes may be,
and acquiring an image containing a person in a specified place as the image to be processed.
25. The apparatus of claim 23, further comprising: an identification post-processing module to:
according to the behavior information of the human body contained in the image, if the behavior type in any one of the behavior information is determined to belong to a preset behavior set, the behavior information of which the behavior type belongs to the preset behavior set is correspondingly processed.
26. The device of claim 25, wherein the identification post-processing module is further configured to:
and generating a behavior report according to the behavior information of which the behavior type belongs to a preset behavior set, and storing or sending the behavior report according to a preset rule.
27. An apparatus for behavior recognition model training, wherein the behavior recognition model comprises a feature extraction network and a classification network based on a multi-branch self-attention mechanism, the apparatus comprising:
the characteristic extraction module is used for inputting a sample image into the characteristic extraction network and extracting the image characteristics of the sample image;
the behavior identification module is used for encoding the image features and the position information corresponding to the image features, inputting the image features and the position information into the classification network based on the multi-branch sub-attention mechanism, and identifying the behavior information of the human body contained in the sample image through the classification network to obtain the prediction data of the behavior information;
the loss calculation module is used for calculating a loss value according to the real data and the predicted data of the behavior information;
and the parameter updating module is used for updating the parameters of the behavior recognition model according to the loss value.
28. The device of claim 27, wherein the feature extraction module is further to:
and inputting the sample image into the feature extraction network, and after extracting the image features of the sample image, performing dimensionality reduction processing on the image features to obtain preset-dimensionality image features.
29. The apparatus of claim 28, further comprising: a position encoding module to:
and according to the preset dimension, carrying out position coding on pixels in the image characteristics to obtain position information codes corresponding to the image characteristics.
30. The device of any of claims 27-29, wherein the behavior identification module is further to:
and coding the image features and the position information corresponding to the image features, inputting the coded image features into the classification network, and outputting the specified amount of prediction data through the classification network.
31. The apparatus of claim 27 or 30, wherein the loss calculation module is further configured to:
matching the predicted data and the real data, and determining the real data corresponding to the predicted data to ensure that the comprehensive loss of the predicted data and the corresponding real data is minimum; and taking the comprehensive loss of the prediction data and the corresponding real data as a loss value.
32. The device of claim 31, wherein the loss calculation module is further to:
and matching the predicted data and the real data by using a Hungarian algorithm, and determining the real data corresponding to the predicted data to ensure that the comprehensive loss of the predicted data and the corresponding real data is minimum.
33. The apparatus of claim 32, wherein,
the combined loss of the prediction data and the corresponding real data is as follows: a sum of losses between each of the prediction data and corresponding real data;
wherein the loss between each of the prediction data and the corresponding real data is determined according to one or more loss function values between the prediction data and the corresponding real data.
34. The device of any of claims 27-33, wherein the parameter update module is further to:
and minimizing the loss value by using a gradient descent algorithm, and updating parameters of the behavior recognition model.
35. The apparatus of claim 27, further comprising: a training data acquisition module to:
and acquiring the sample image and annotation data corresponding to the sample image, wherein the annotation data comprises real data of behavior information of a human body contained in the sample image.
36. The apparatus according to any one of claims 27-35, wherein the behavior information is used to describe an interaction behavior of the human body with an object,
the behavior information includes position information of the human body, position information and a category of the object, and a behavior category of the interactive behavior.
37. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-18.
38. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-18.
39. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-18.
CN202110721126.8A 2021-06-28 2021-06-28 Behavior recognition and model training method, apparatus, storage medium, and program product Pending CN113420681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110721126.8A CN113420681A (en) 2021-06-28 2021-06-28 Behavior recognition and model training method, apparatus, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110721126.8A CN113420681A (en) 2021-06-28 2021-06-28 Behavior recognition and model training method, apparatus, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN113420681A true CN113420681A (en) 2021-09-21

Family

ID=77716949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110721126.8A Pending CN113420681A (en) 2021-06-28 2021-06-28 Behavior recognition and model training method, apparatus, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN113420681A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837305A (en) * 2021-09-29 2021-12-24 北京百度网讯科技有限公司 Target detection and model training method, device, equipment and storage medium
CN114445748A (en) * 2022-01-28 2022-05-06 深圳市中云慧通科技有限公司 Video human body feature detection and linkage alarm method and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257572A (en) * 2020-10-20 2021-01-22 神思电子技术股份有限公司 Behavior identification method based on self-attention mechanism
CN112464861A (en) * 2020-12-10 2021-03-09 中山大学 Behavior early recognition method, system and storage medium for intelligent human-computer interaction
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257572A (en) * 2020-10-20 2021-01-22 神思电子技术股份有限公司 Behavior identification method based on self-attention mechanism
CN112464861A (en) * 2020-12-10 2021-03-09 中山大学 Behavior early recognition method, system and storage medium for intelligent human-computer interaction
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837305A (en) * 2021-09-29 2021-12-24 北京百度网讯科技有限公司 Target detection and model training method, device, equipment and storage medium
US11823437B2 (en) 2021-09-29 2023-11-21 Beijing Baidu Netcom Science Technology Co., Ltd. Target detection and model training method and apparatus, device and storage medium
CN114445748A (en) * 2022-01-28 2022-05-06 深圳市中云慧通科技有限公司 Video human body feature detection and linkage alarm method and storage medium

Similar Documents

Publication Publication Date Title
CN113222916A (en) Method, apparatus, device and medium for detecting image using target detection model
CN113657289B (en) Training method and device of threshold estimation model and electronic equipment
CN113420681A (en) Behavior recognition and model training method, apparatus, storage medium, and program product
CN114494815B (en) Neural network training method, target detection method, device, equipment and medium
CN113657269A (en) Training method and device for face recognition model and computer program product
CN114494784A (en) Deep learning model training method, image processing method and object recognition method
CN114282670A (en) Neural network model compression method, device and storage medium
CN113591566A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN112861885A (en) Image recognition method and device, electronic equipment and storage medium
CN113869205A (en) Object detection method and device, electronic equipment and storage medium
CN112580666A (en) Image feature extraction method, training method, device, electronic equipment and medium
CN114715145A (en) Trajectory prediction method, device and equipment and automatic driving vehicle
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN113177466A (en) Identity recognition method and device based on face image, electronic equipment and medium
CN113920158A (en) Training and traffic object tracking method and device of tracking model
CN113705716A (en) Image recognition model training method and device, cloud control platform and automatic driving vehicle
CN113627361A (en) Training method and device for face recognition model and computer program product
CN113989569B (en) Image processing method, device, electronic equipment and storage medium
CN115631502A (en) Character recognition method, character recognition device, model training method, electronic device and medium
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN115457329A (en) Training method of image classification model, image classification method and device
CN114387651A (en) Face recognition method, device, equipment and storage medium
CN113361363A (en) Training method, device and equipment for face image recognition model and storage medium
CN114549904A (en) Visual processing and model training method, apparatus, storage medium, and program product
CN114093006A (en) Training method, device and equipment of living human face detection model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination