CN116597516A

CN116597516A - Training method, classification method, detection method, device, system and equipment

Info

Publication number: CN116597516A
Application number: CN202310576951.2A
Authority: CN
Inventors: 张伟; 袁甲; 张�浩; 李溢翔
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-15

Abstract

The present disclosure provides a training method, a classification method, a detection method, a device, a system and a device, which can be applied to the fields of behavior classification and finance. The training method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of training videos and classification labels; inputting a training video into an initial behavior classification model, and outputting a plurality of multi-scale feature matrixes corresponding to the training video, wherein each multi-scale feature matrix comprises a plurality of prediction areas comprising different key points of an object; performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain behavior classification results of the training video, wherein the marking points represent positions of different key points of an object in each image frame; inputting a classification result and a classification label corresponding to each training video into a loss function, and outputting a loss result; and iteratively adjusting network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

Description

Training method, classification method, detection method, device, system and equipment

Technical Field

The present disclosure relates to the field of behavior classification and finance, and more particularly, to a training method of an object behavior classification model, an object behavior classification method, a driving safety detection method of a transport vehicle, a training apparatus of an object behavior classification model, an object behavior classification apparatus, a vehicle monitoring system, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of technology, financial business management has put forward higher standards on aspects such as financial security, financial service quality and effect. The financial escort has important roles in maintaining property transportation safety, guaranteeing banking financial business, improving financial service quality, building financial market order and the like as an important component in financial business.

In the financial escort business, the traditional financial escort has more manual operation flows, which can have a plurality of effects on the bank to develop financial business and system management. Among them, the transportation vehicles and escort personnel have the characteristics of uncertainty, randomness, fluidity, etc., and these factors with uncontrollability are one of the main reasons for reducing the efficiency of the financial escort business. Meanwhile, some non-compliance actions of escort personnel can reduce the safety of transportation tasks, so that unnecessary safety risks and property losses are brought to escort tasks.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: in the related art, when classifying the behaviors of the object, the accuracy of the classification result is poor, and the behaviors of the object cannot be accurately identified, so that the escort task is influenced.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a training method of an object behavior classification model, an object behavior classification method, a running safety detection method of a transport vehicle, a training apparatus of an object behavior classification model, an object behavior classification apparatus, a vehicle monitoring system, an electronic device, a computer-readable storage medium, and a computer program product.

According to a first aspect of the present disclosure, there is provided a training method of an object behavior classification model, including:

acquiring a training set, wherein the training set comprises a plurality of training videos and classification labels, and the videos comprise a plurality of image frames which are associated in time sequence;

inputting the training video into an initial behavior classification model, and outputting a plurality of multi-scale feature matrixes corresponding to the training video, wherein each multi-scale feature matrix comprises a plurality of prediction areas comprising different key points of an object;

performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain behavior classification results of the training video, wherein the marking points represent positions of different key points of the object in each image frame;

Inputting a classification result and a classification label corresponding to each training video into a loss function, and outputting a loss result;

and iteratively adjusting network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

According to an embodiment of the present disclosure, in a case where the number of the multi-scale feature matrices is three, the inputting the training video into the initial behavior classification model, outputting a plurality of multi-scale feature matrices corresponding to the training video, includes:

based on a first preset step length, carrying out channel adjustment and feature extraction processing on a plurality of image frames by utilizing a feature extraction sub-model to obtain first image features;

processing the first image features by using a channel adjustment sub-model to obtain second image features and third image features;

and respectively processing the first image feature, the second image feature and the third image feature by using the first multi-scale sub-model, the second multi-scale sub-model and the third multi-scale sub-model to obtain three multi-scale feature matrixes.

According to an embodiment of the disclosure, the performing channel adjustment and feature extraction processing on the plurality of image frames by using a feature extraction sub-model based on a first preset step length to obtain a first image feature includes:

Performing channel adjustment and feature extraction processing on a plurality of image frames by using a plurality of first convolution normalization layers to obtain first intermediate features, wherein one convolution normalization layer corresponds to a first preset step length;

carrying out channel adjustment and feature stacking treatment on the first intermediate features by using a first feature treatment layer to obtain second intermediate features;

performing downsampling processing on the second intermediate feature by using a first downsampling layer to obtain a third intermediate feature;

and carrying out channel adjustment and feature stacking processing on the third intermediate features by using a second feature processing layer to obtain the first image features.

According to an embodiment of the present disclosure, the processing the first image feature using the channel adjustment sub-model to obtain a second image feature and a third image feature includes:

downsampling the first image feature by using a second downsampling layer to obtain a fourth intermediate feature;

carrying out channel adjustment and feature extraction processing on the fourth intermediate feature by using a third feature processing layer to obtain the second image feature;

downsampling the second image feature by using a third downsampling layer to obtain a fifth intermediate feature;

And carrying out channel adjustment and feature extraction processing on the fifth intermediate feature by using a fourth feature processing layer to obtain the third image feature.

According to an embodiment of the present disclosure, the processing the first image feature, the second image feature, and the third image feature by using the first multi-scale sub-model, the second multi-scale sub-model, and the third multi-scale sub-model to obtain three multi-scale feature matrices includes:

processing the first image feature and the first transition feature by using the first multi-scale sub-model, and outputting a multi-scale feature matrix and a second transition feature;

processing the second image feature, the second transition feature and the third transition feature by using the second multi-scale sub-model, and outputting a multi-scale feature matrix, the first transition feature and a fourth transition feature;

and processing the third image feature and the fourth transition feature by using the third multi-scale submodel, and outputting the multi-scale feature matrix and the third transition feature.

According to an embodiment of the present disclosure, the processing the first image feature and the first transition feature using the first multi-scale sub-model, and outputting the multi-scale feature matrix and the second transition feature, includes:

Based on a second preset step length, carrying out channel adjustment and feature extraction processing on the first image feature and the first transition feature by using two second convolution normalization layers to obtain a first channel feature and a second channel feature;

performing feature layer expansion processing on the second channel feature by using the first feature expansion layer to obtain a third channel feature;

performing feature stacking processing on the first channel feature and the third channel feature by using a first feature stacking layer to obtain a fourth channel feature;

carrying out channel adjustment and feature extraction processing on the fourth channel feature by utilizing a fifth feature processing layer to obtain a fifth channel feature, wherein the fifth channel feature comprises two sub-channel features with preset channel numbers;

downsampling one of the sub-channel features by using a fourth downsampling layer to obtain the second transition feature;

carrying out convolution, normalization and feature superposition processing on the other sub-channel feature by using the first convolution superposition layer to obtain a sixth channel feature;

and based on the second preset step length, carrying out channel adjustment and feature extraction processing on the sixth channel features by using a third convolution normalization layer to obtain a first multi-scale feature matrix, wherein the first multi-scale feature matrix comprises a first preset number of grids and a target number of channels.

According to an embodiment of the disclosure, the processing the second image feature, the second transition feature, and the third transition feature using the second multi-scale sub-model, and outputting one multi-scale feature matrix, the first transition feature, and the fourth transition feature, includes:

based on a third preset step length, carrying out channel adjustment and feature extraction processing on the second image feature and the third transition feature by using two fourth convolution normalization layers to obtain a seventh channel feature and an eighth channel feature;

performing feature layer expansion processing on the eighth channel feature by using a second feature expansion layer to obtain a ninth channel feature;

performing feature stacking processing on the seventh channel feature and the ninth channel feature by using a second feature stacking layer to obtain a tenth channel feature;

performing channel adjustment and feature extraction processing on the tenth channel feature by using a sixth feature processing layer to obtain an eleventh channel feature;

performing feature stacking processing on the eleventh channel feature and the second transition feature by using a third feature stacking layer to obtain a twelfth channel feature;

performing channel adjustment and feature extraction processing on the twelfth channel features by using a seventh feature processing layer to obtain thirteenth channel features;

Downsampling the thirteenth channel feature by using a fifth downsampling layer to obtain the fourth transition feature;

carrying out convolution, normalization and feature superposition processing on the thirteenth channel feature by using a second convolution superposition layer to obtain a fourteenth channel feature;

and based on the third preset step length, performing channel adjustment and feature extraction processing on the fourteenth channel feature by using a fifth convolution normalization layer to obtain a second multi-scale feature matrix, wherein the second multi-scale feature matrix comprises a second preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the processing the third image feature and the fourth transition feature using the third multi-scale sub-model to output the multi-scale feature matrix and the third transition feature includes:

performing feature extraction, pooling and stacking treatment on the third image features by using a feature extraction stacking layer to obtain third transition features;

performing feature stacking processing on the third transition feature and the fourth transition feature by using a fourth feature stacking layer to obtain a fifteenth channel feature;

performing channel adjustment and feature extraction processing on the fifteenth channel feature by using an eighth feature processing layer to obtain a sixteenth channel feature;

Carrying out convolution, normalization and feature superposition processing on the sixteenth channel feature by using a third convolution superposition layer to obtain a seventeenth channel feature;

and based on a fourth preset step length, carrying out channel adjustment and feature extraction processing on the seventeenth channel feature by using a sixth convolution normalization layer to obtain a third multi-scale feature matrix, wherein the third multi-scale feature matrix comprises a third preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the performing, by using the labeled points based on the training video, a behavior classification process on the plurality of prediction areas corresponding to the training video to obtain a behavior classification result of the training video includes:

processing the positions of the mark points in each preset area based on a preset key point model to obtain the state parameters of each key point;

determining a behavior state of the object and a time and/or a number of times in the behavior state based on a plurality of the state parameters;

classifying the training video into a first classification sub-result when the behavior state belongs to one of classification lists and the time or the number of times satisfies a preset condition;

Classifying the training video into a second classification sub-result when the behavior state belongs to one of the classification lists and the time or the number of times does not satisfy a preset condition;

and classifying the training video into a third classification sub-result when the behavior state does not belong to any one of the classification lists, wherein the behavior classification result comprises the first classification sub-result, the second classification sub-result and the third classification sub-result.

According to a second aspect of the present disclosure, there is provided an object behavior classification method, including:

acquiring a video to be classified, wherein the video to be classified comprises a plurality of time-sequence-related image frames to be classified;

inputting a plurality of image frames to be classified of the video to be classified into an object behavior classification model, and outputting a prediction behavior classification result, wherein the prediction behavior classification result represents the behavior gesture of the object under the condition that the object exists in the video to be classified.

According to an embodiment of the present disclosure, the object behavior classification method further includes:

determining target information corresponding to the preset behavior gesture from an information list under the condition that the predicted behavior classification result shows that the behavior gesture of the object belongs to the preset behavior gesture;

And displaying the target information to the object in a visual form.

According to a third aspect of the present disclosure, there is provided a travel safety detection method of a transport vehicle, including:

acquiring in-vehicle video of the transport vehicle in real time by using an image acquisition device of the transport vehicle under the condition that the transport vehicle runs, wherein the in-vehicle video comprises a plurality of in-vehicle images which are associated in time sequence;

transmitting a plurality of in-vehicle images of the in-vehicle video to a server, so that the server processes the in-vehicle video based on an object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result represents the behavior gesture of at least one object in the in-vehicle video;

determining first alarm information corresponding to the illegal behavior gesture from an alarm information list and transmitting the first alarm information to the transport vehicle under the condition that the in-vehicle behavior classification result shows that the behavior gesture of the object belongs to the illegal behavior gesture;

and displaying the first alarm information to the object in a visual form.

According to an embodiment of the present disclosure, the travel safety detection method of a transport vehicle further includes:

Collecting vehicle state parameters of the transport vehicle by using a sensor module, wherein the vehicle state parameters comprise at least one of the following: tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

transmitting the vehicle state parameters to the server so that the server transmits second alarm information corresponding to the parameters to the transport vehicle when any one of the vehicle state parameters satisfies an alarm condition;

and displaying the second alarm information to the object in a visual form.

According to a fourth aspect of the present disclosure, there is provided a training apparatus of an object behavior classification model, including:

the first acquisition module is used for acquiring a training set, wherein the training set comprises a plurality of training videos and classification labels, and the videos comprise a plurality of image frames which are associated in time sequence;

the multi-scale module is used for inputting the training video into an initial behavior classification model and outputting a plurality of multi-scale feature matrixes corresponding to the training video, wherein each multi-scale feature matrix comprises a plurality of prediction areas comprising different key points of an object;

The first classification module is used for performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain behavior classification results of the training video, wherein the marking points represent positions of different key points of the object in each image frame;

the loss module is used for inputting the classification result and the classification label corresponding to each training video into a loss function and outputting a loss result;

and the iteration module is used for iteratively adjusting the network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

According to a fifth aspect of the present disclosure, there is provided an object behavior classification apparatus, comprising:

the second acquisition module is used for acquiring videos to be classified, wherein the videos to be classified comprise a plurality of time-sequence-related image frames to be classified;

the second classification module is used for inputting a plurality of image frames to be classified of the video to be classified into an object behavior classification model and outputting a prediction behavior classification result, wherein the prediction behavior classification result represents the behavior gesture of the object under the condition that the object exists in the video to be classified.

According to a sixth aspect of the present disclosure, there is provided a vehicle monitoring system comprising:

a server configured to:

processing an in-vehicle video based on an object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result represents the behavior gesture of at least one object in the in-vehicle video;

under the condition that the behavior classification result in the vehicle shows that the behavior gesture of the object belongs to the illegal behavior gesture, determining first alarm information corresponding to the illegal behavior gesture from an alarm information list and transmitting the first alarm information to an alarm device;

a transportation vehicle, said transportation vehicle comprising:

a vehicle body;

an image acquisition device configured to acquire the in-vehicle video of the transport vehicle in real time and transmit to the server in a case where the vehicle body is traveling, wherein the in-vehicle video includes a plurality of in-vehicle images associated in time series;

the alarm device is configured to display the first alarm information to the object in a visual form.

According to an embodiment of the present disclosure, the vehicle monitoring system further includes:

a sensor module configured to:

Collecting vehicle state parameters of the transport vehicle and transmitting the vehicle state parameters to the server, wherein the vehicle state parameters comprise at least one of the following: tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

wherein the server is further configured to transmit second alarm information corresponding to any one of the vehicle state parameters to the alarm device if the parameter satisfies an alarm condition;

the alarm device is further configured to display the second alarm information to the subject in a visual form.

A seventh aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

An eighth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.

A ninth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the embodiment of the disclosure, the multi-scale feature matrix in the image is extracted through the initial behavior classification model, so that the prediction area where the object is located can be determined as much as possible, the object behaviors in the prediction area are classified by using the marking points of the key points, the network parameters are iteratively adjusted based on the marking points of different key points of the object and the loss result determined by the classification result, so that the object behavior classification model capable of being used for behavior classification is obtained.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a training method of an object behavior classification model or an object behavior classification method according to an embodiment of the disclosure;

FIG. 2 schematically illustrates a flow chart of a method of training an object behavior classification model according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a process flow diagram of an object behavior classification model according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a block diagram of the architecture of a CBM module according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of the ESCP1 module according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of the ESCPM module according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of the ESCP2 module according to embodiments of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a REPC module according to an embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of the SPPCM module according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of the CBS module according to an embodiment of the present disclosure;

FIG. 11 schematically illustrates a face feature key point coordinate location schematic in accordance with an embodiment of the present disclosure;

FIG. 12 schematically illustrates a flow chart of an object behavior classification method according to an embodiment of the disclosure;

Fig. 13 schematically illustrates a flowchart of a travel safety detection method of a transport vehicle according to an embodiment of the present disclosure;

FIG. 14 schematically illustrates a block diagram of a training apparatus of an object behavior classification model according to an embodiment of the disclosure;

FIG. 15 schematically illustrates a block diagram of an object behavior classification apparatus according to an embodiment of the disclosure;

FIG. 16 schematically illustrates a block diagram of a vehicle monitoring system in accordance with an embodiment of the present disclosure; and

fig. 17 schematically illustrates a block diagram of an electronic device adapted to implement the above-described method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

The embodiment of the disclosure provides a training method, a classification method, a detection method, a device system, equipment and a medium, wherein the training method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of training videos and classification labels, and the videos comprise a plurality of image frames which are associated in time sequence; inputting a training video into an initial behavior classification model, and outputting a plurality of multi-scale feature matrixes corresponding to the training video, wherein each multi-scale feature matrix comprises a plurality of prediction areas comprising different key points of an object; performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain behavior classification results of the training video, wherein the marking points represent positions of different key points of the object in each image frame; inputting a classification result and a classification label corresponding to each training video into a loss function, and outputting a loss result; and iteratively adjusting network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

Fig. 1 schematically illustrates an application scenario diagram of a training method of an object behavior classification model or an object behavior classification method according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 according to this embodiment may include a behavior classification of a driver when a escort vehicle of a bank performs a escort task. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the training method of the object behavior classification model or the object behavior classification method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the training apparatus of the object behavior classification model or the object behavior classification apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The training method of the object behavior classification model or the object behavior classification method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the training apparatus of the object behavior classification model or the object behavior classification apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flowchart of a method of training an object behavior classification model according to an embodiment of the disclosure.

As shown in fig. 2, the training method of the object behavior classification model of this embodiment includes operations S210 to S250.

In operation S210, a training set is acquired, wherein the training set includes a plurality of training videos and classification tags, the videos including a plurality of image frames that are associated in time sequence;

in operation S220, inputting the training video into the initial behavior classification model, and outputting a plurality of multi-scale feature matrices corresponding to the training video, wherein each multi-scale feature matrix includes a plurality of prediction regions including different key points of the object;

in operation S230, performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the mark points of the training video to obtain a behavior classification result of the training video, wherein the mark points represent positions of different key points of the object in each image frame;

in operation S240, a classification result and a classification label corresponding to each training video are input into a loss function, and a loss result is output;

In operation S250, network parameters of the initial behavior classification model are iteratively adjusted according to the loss result, generating a trained object behavior classification model.

According to embodiments of the present disclosure, a training video includes multiple frames of images, wherein 2n (n.gtoreq.1) frames of images include objects, and classification tags characterize behaviors of the objects in the training video, such as smoking, using a cell phone, unbelting, both hands off a steering wheel, drinking water, yawing, closing eyes, and the like. The key points can be eyes, mouth, hands, abdomen, etc. of the human body. The loss function may be a cross entropy function.

According to an embodiment of the present disclosure, the prediction region may refer to dividing different numbers of grids in the generated multi-scale feature matrix, and enclosing key points of an object in the grids by means of anchor frames, wherein the object may refer to a human body. It should be noted that, in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying related data (including but not limited to personal information of a user) all conform to the rules of the related laws and regulations, and necessary security measures are taken without departing from the public welfare.

According to the embodiment of the disclosure, each training video is input into an initial behavior classification model to obtain a plurality of multi-scale feature maps corresponding to each training video, behavior classification processing is performed on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain a behavior classification result of the training video, for example, it is determined that an object in the training video is in a drinking state. Inputting the classification result and the classification label corresponding to the training video into a loss function to calculate a loss result, and further iteratively adjusting network parameters of the initial pacifying classification model according to the loss result to obtain the object behavior classification model.

FIG. 3 schematically illustrates a process flow diagram of an object behavior classification model according to an embodiment of the disclosure.

As shown in fig. 3, in the case that the number of the multi-scale feature matrices is three, the training video is input to the initial behavior classification model, and a plurality of multi-scale feature matrices corresponding to the training video are output, including:

processing the first image feature by using the channel adjustment sub-model to obtain a second image feature and a third image feature;

According to an embodiment of the present disclosure, the first preset step size may be specifically set according to actual requirements, and may be at least one of 1, 2, 3, or a combination of a plurality of them, for example.

According to an embodiment of the present disclosure, an exemplary description is given of a 3-channel image having an image frame length and width of 640 and 640, respectively, that is, the image frame has a size of 640×640×3. Based on the first preset step length, channel adjustment and feature extraction processing are performed on the plurality of image frames by using the feature extraction sub-model, for example, in the case that the channel number of the plurality of image frames is 3, for example, the feature extraction sub-model can obtain a first image feature with the channel number of 512 after channel adjustment is performed on the plurality of image frames, and meanwhile, the feature extraction sub-model can process an image frame with the pixels of 640×640 into a first image feature with the pixel number of 80×80.

According to the embodiment of the disclosure, the channel adjustment sub-model further performs feature extraction and channel adjustment on the first image features of 80×80×512, so that the second image features of 40×40×1024 and the third image features of 20×20×1024 can be obtained.

According to an embodiment of the present disclosure, based on the first image feature of 80×80×512, the second image feature of 40×40×1024, and the third image feature of 20×20×1024, the above image features are processed using the first multi-scale submodel, the second multi-scale submodel, and the third multi-scale submodel, three multi-scale feature matrices may be obtained, wherein the sizes of the three multi-scale feature matrices may be 13×13×39, 26×26×39, 52×52×39, respectively.

According to an embodiment of the present disclosure, 13×13, 26×26, 52×52 respectively represent the image frames divided into grids of 13×13, 26×26, 52×52, each corresponding to 3 anchor frames (i.e., prediction regions), among the sizes of the above multi-scale feature matrices. When the central point of the key point is positioned in a certain anchor frame, the anchor frame is responsible for target detection, and the initial behavior classification model adjusts the position of the anchor frame by learning image characteristics so that the position of the anchor frame finally approaches to the real position of the key point, thereby obtaining the final predicted anchor frame and realizing accurate positioning and identification of the key point. 39 is the product of 3 and 13, 3 means that three anchor boxes are preset for each grid, and 13 means the sum of 8, 4 and 1. Wherein 8 is the number of categories of behavior classification category in the disclosure, 4 is four location parameters of the anchor frame, 1 is the probability of being a target in the anchor frame, i.e. the probability is 1 when the central point of the key point is located in the anchor frame, otherwise is 0.

It should be noted that the dimensions of any of the above image features are exemplary, and may be adjusted according to actual needs, and the dimensions are not limiting on the scope of the disclosure.

According to an embodiment of the disclosure, referring to fig. 3, based on a first preset step size, channel adjustment and feature extraction processing are performed on a plurality of image frames by using a feature extraction sub-model, so as to obtain a first image feature, including:

using a plurality of first convolutionally normalized layers (i.e., CBM's in FIG. 3) ¹ ) Performing channel adjustment and feature extraction processing on a plurality of image frames to obtain a first intermediate feature, wherein one convolution normalization layer corresponds to a first preset step length;

using a first feature handling layer (i.e. ESCP1 in FIG. 3 ¹ ) Performing channel adjustment and feature stacking processing on the first intermediate features to obtain second intermediate features;

using a first downsampling layer (i.e. escm in fig. 3 ¹ ) Downsampling the second intermediate feature to obtain a third intermediate feature;

treatment of the layer with a second feature (i.e., ESCP1 in FIG. 3 ² ) And carrying out channel adjustment and feature stacking processing on the third intermediate feature to obtain a first image feature.

According to an embodiment of the present disclosure, as shown in fig. 3, four first convolution normalization layers are illustrated, where the CBM in the feature extraction sub-model in fig. 3 is the first convolution normalization layer, the first preset step sizes of the four first convolution normalization layers are s=1, s=2, and the convolution kernel size is 3×3.

According to an embodiment of the present disclosure, 640×640×3 image frames are input to a first convolution normalization layer, 640×640×32 features are output, the features are input to a second first convolution normalization layer, 320×320×64 features are output, the features are input to a third first convolution normalization layer, 320×320×64 features are output, the features are input to a fourth first convolution normalization layer, and 160×160×128 first intermediate features are output.

According to an embodiment of the present disclosure, a 160×160×128 first intermediate feature is input to a first feature processing layer, that is, ESCP1 in the feature extraction sub-model in fig. 3, to perform channel adjustment and feature stacking processing, and a 160×160×256 second intermediate feature may be output. And performing downsampling processing on the second intermediate feature by using the first downsampling layer to obtain a third intermediate feature of 80×80×256.

According to an embodiment of the present disclosure, channel adjustment and feature stacking processing are performed on the third intermediate features of 80×80×256 using the second feature processing layer, resulting in the first image features of 80×80×512.

FIG. 4 schematically illustrates a block diagram of the architecture of a CBM module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 4, any one of the above first convolution normalization layer, the below second convolution normalization layer, the third convolution normalization layer, the fourth convolution normalization layer, and the fifth convolution normalization layer CBM module includes a normal convolution layer Conv, a batch normalization layer BN, and a mich activation function.

The convolution kernel size of the common convolution layer comprises 1×1 and 3×3, wherein the 1×1 convolution kernel is used for adjusting the number of channels of the feature map, and the 3×3 convolution kernel is used for feature extraction. And the problem of unbalance of the distribution of key points of each image frame in the training set causes the increase of model training and reasoning cost. Therefore, batch normalization operation is adopted to balance data distribution, so that the convergence rate of the model is improved, and the original feature expression capability of the feature information in the transmission process is ensured. Equation (1) is a dash activation function. The activation function has the characteristics of no upper bound, no lower bound, non-monotone and the like, and can not only improve the generalization capability of the model, but also better improve the nonlinear characteristic expression capability of the model.

Mish＝x·tanh(ln(1+e ^x )) (1)

FIG. 5 schematically illustrates a block diagram of the ESCP1 module according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, any one of the above first feature processing layer, second feature processing layer, the following third feature processing layer, and fourth feature processing layer may be an ESCP1 module structure shown in fig. 5.

In accordance with embodiments of the present disclosure, ESCP1 is an efficient feature extraction module that can effectively control the longest and shortest gradient paths so that the model learns more effective features, thereby increasing the robustness of the model. The ESCP1 module has a characteristic transfer path, and the first path carries out channel adjustment through a CBM module with a convolution kernel size of 1×1 and a step size of 1. In the second path, the CBM module with the convolution kernel size of 1×1 and the step size of 1 is used for channel adjustment (the CBM module structure refers to fig. 4), then the CBM module with the convolution kernel size of 3×3 and the step size of 1 is used for feature extraction, and the feature stacking layer CONC is used for feature stacking the multi-scale features transferred from each path, that is, the number of stacked feature layer channels is the sum of the number of path channels, so that the local feature information in the feature layer is enriched. Finally, the CBM module with the convolution kernel size of 3 multiplied by 3 and the step length of 1 is adopted to carry out channel adjustment on the feature layers after stacking, and then the feature layers are used as input features of the next layer.

FIG. 6 schematically illustrates a block diagram of the ESCPM module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, any one of the above first downsampling layer, the following second downsampling layer, the third downsampling layer, the fourth downsampling layer, and the fifth downsampling layer may be the escm module structure shown in fig. 6.

According to an embodiment of the present disclosure, the ECSPM module implements a downsampling operation in that the feature layer length and width are 1/2 times greater than before. The ECSPM module mainly comprises two paths, wherein the first path firstly uses a MaxPool pooling module with a pooling kernel size of 2 multiplied by 2 to perform downsampling operation on the feature layer, and then uses a CBM module with a convolution kernel size of 1 multiplied by 1 and a step length of 1 to perform channel adjustment on the feature layer. In the second path, the CBM module with the convolution kernel size of 1×1 and the step size of 1 is used to perform channel adjustment on the feature layer, and then the CBM module with the convolution kernel size of 3×3 and the step size of 2 is used to perform downsampling operation on the feature layer. Finally, the feature layers on the two paths are stacked through the feature stacking layer CONC, the length and the width of the finally obtained output feature layer become 1/2 times of the input feature layer, and the number of channels becomes 2 times of the input feature layer, so that the ECSPM module realizes the downsampling operation.

According to an embodiment of the present disclosure, referring to fig. 3, processing the first image feature using the channel adjustment sub-model to obtain a second image feature and a third image feature includes:

using a second downsampling layer (i.e. escm in fig. 3 ² ) Downsampling the first image feature to obtain a fourth intermediate feature;

treatment of the layer with a third feature (i.e., ESCP1 in FIG. 3 ³ ) Performing channel adjustment and feature extraction processing on the fourth intermediate feature to obtain a second image feature;

using a third downsampling layer (i.e. escm in fig. 3 ³ ) Downsampling the second image feature to obtain a fifth intermediate feature;

treatment of the layer with the fourth feature (i.e., ESCP1 in FIG. 3 ⁴ ) And carrying out channel adjustment and feature extraction processing on the fifth intermediate feature to obtain a third image feature.

According to the embodiment of the disclosure, the first image feature of 80×80×512 is input into the channel adjustment sub-model, the second downsampling layer performs downsampling processing on the first image feature, and then outputs a fourth intermediate feature of 40×40×512, the third feature processing layer performs channel adjustment and feature extraction processing on the fourth intermediate feature to obtain a second image feature of 40×40×1024, the third downsampling layer performs downsampling processing on the second image feature to obtain a fifth intermediate feature of 20×20×1024, and the fourth feature processing layer performs channel adjustment and feature extraction processing on the fifth intermediate feature to obtain a third image feature of 20×20×1024.

According to an embodiment of the present disclosure, referring to fig. 3, a first image feature, a second image feature, and a third image feature are respectively processed using a first multi-scale sub-model, a second multi-scale sub-model, and a third multi-scale sub-model, to obtain three multi-scale feature matrices, including:

processing the second image feature, the second transition feature and the third transition feature by using the second multi-scale sub-model, and outputting a multi-scale feature matrix, a first transition feature and a fourth transition feature;

and processing the third image feature and the fourth transition feature by using the third multi-scale submodel, and outputting a multi-scale feature matrix and the third transition feature.

According to the embodiment of the disclosure, the first multi-scale sub-model, the second multi-scale sub-model and the third multi-scale sub-model are processed in parallel, and at this time, the first multi-scale sub-model generates a multi-scale feature matrix according to the first image features output by the feature extraction sub-model and the first transition features output by the second multi-scale sub-model. The second multi-scale submodel generates a multi-scale feature matrix according to the second transition feature output by the first multi-scale submodel, the second image feature output by the channel adjustment submodel and the third transition feature output by the third multi-scale submodel. And the third multi-scale sub-model generates a multi-scale feature matrix according to the third image feature output by the channel adjustment sub-model and the fourth transition feature output by the second multi-scale sub-model.

According to an embodiment of the present disclosure, referring to fig. 3, processing a first image feature and a first transition feature using a first multi-scale sub-model, outputting a multi-scale feature matrix and a second transition feature, comprising:

based on the second preset step size, two second convolution normalization layers (i.e., CBM in FIG. 3) are utilized ² ) Channel adjustment and feature extraction processing are respectively carried out on the first image features and the first transition features, so that first channel features and second channel features are obtained;

using a first feature expansion layer (i.e., UPS in fig. 3) ¹ ) Performing feature layer expansion processing on the second channel feature to obtain a first channel featureThree channel characteristics;

stacking layers using the first feature (i.e., CONC in FIG. 3) ¹ ) Performing feature stacking processing on the first channel feature and the third channel feature to obtain a fourth channel feature;

treatment of the layer with the fifth feature (i.e., ESCP2 in FIG. 3 ¹ ) Carrying out channel adjustment and feature extraction processing on the fourth channel feature to obtain a fifth channel feature, wherein the fifth channel feature comprises two sub-channel features with preset channel numbers;

with a fourth downsampling layer (i.e., ESCPM in FIG. 3) ⁴ ) Downsampling one sub-channel feature to obtain a second transition feature;

with a first convolutionally superimposed layer (i.e., REPC in fig. 3 ¹ ) Carrying out convolution, normalization and feature superposition processing on the other sub-channel feature to obtain a sixth channel feature;

based on the second preset step size, a third convolution normalization layer (i.e., CBS in fig. 3 ¹ ) And carrying out channel adjustment and feature extraction processing on the sixth channel features to obtain a first multi-scale feature matrix, wherein the first multi-scale feature matrix comprises a first preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the second preset step size may be specifically set according to the actual situation, for example, the second preset step size s=1.

According to an embodiment of the present disclosure, one second convolution normalization layer processes the 80×80×512 first image features, outputting the 80×80×128 first channel features, and the other second convolution normalization layer processes the 40×40×256 first transition features, outputting the 40×40×128 second channel features.

According to the embodiment of the disclosure, the first feature expansion layer is utilized to perform feature layer expansion processing on the second channel feature to obtain a third channel feature of 80×80×128, and the first feature stacking layer is utilized to perform feature stacking processing on the first channel feature and the third channel feature to obtain a fourth channel feature of 80×80×256. And carrying out channel adjustment and feature extraction processing on the fourth channel feature by utilizing a fifth feature processing layer to obtain an 80 multiplied by 256 fifth channel feature, wherein the fifth channel feature comprises two sub-channel features with preset channel numbers, and the sub-channel features are 80 multiplied by 128.

According to an embodiment of the present disclosure, a sub-channel feature is downsampled using a fourth downsampling layer to obtain a 40×40×256 second transition feature. And carrying out convolution, normalization and feature superposition processing on the other sub-channel feature by using the first convolution superposition layer to obtain a sixth channel feature, and carrying out channel adjustment and feature extraction processing on the sixth channel feature by using the third convolution normalization layer based on the second preset step length to obtain a first multi-scale feature matrix.

Fig. 7 schematically illustrates a block diagram of the ESCP2 module according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, any one of the above fifth feature processing layer, the following sixth feature processing layer, the seventh feature processing layer, and the eighth feature processing layer includes an ESCP2 module structure as shown in fig. 7. Similar to ESCP1 module, ESCP2 module mainly adopts CBM module with convolution kernel size of 1×1 and 3×3 and step size of 1 to make channel adjustment and feature extraction respectively. Unlike ESCP1, ESCP2 module performs stacking operation after feature is split and led out after each CBM module with convolution kernel size of 3×3, which can not only improve the transmission efficiency of features in network, but also has better effect on enriching deep local features.

Fig. 8 schematically shows a block diagram of the REPC module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the above first convolution overlay layer, the below second convolution overlay layer, and the third convolution overlay layer each include a REPC module as shown in fig. 8, which is mainly composed of Conv normal convolution operation, BN batch normalization, and Add weighting operation. The REPC module divides the input features into three paths for transmission, firstly, the common convolution with the convolution kernel size of 3 multiplied by 3 is adopted in the first path for feature extraction, then the common convolution with the convolution kernel size of 1 multiplied by 1 is adopted in the second path for feature smoothing, and finally, the input features are directly subjected to batch normalization operation by the last path. Features on the three paths are processed and then subjected to feature fusion by using an Add weighting operation, the number of channels of the features processed by the Add weighting operation is kept unchanged, and superposition operation is performed among the features, so that the output features processed by the REPC module have more accurate target positioning information.

According to an embodiment of the present disclosure, referring to fig. 3, processing the second image feature, the second transition feature, and the third transition feature using the second multi-scale sub-model, outputting a multi-scale feature matrix, the first transition feature, and the fourth transition feature, includes:

Based on the third preset step size, two fourth convolution normalization layers (i.e., CBM in FIG. 3) are utilized ⁴ ) Respectively carrying out channel adjustment and feature extraction processing on the second image feature and the third transition feature to obtain a seventh channel feature and an eighth channel feature;

with a second feature expansion layer (i.e. UPS in fig. 3) ² ) Performing feature layer expansion processing on the eighth channel feature to obtain a ninth channel feature;

stacking layers using the second feature (i.e., CONC in FIG. 3) ² ) Performing feature stacking processing on the seventh channel feature and the ninth channel feature to obtain a tenth channel feature;

treatment of the layer with the sixth feature (i.e., ESCP2 in FIG. 3 ⁶ ) Performing channel adjustment and feature extraction processing on the tenth channel feature to obtain an eleventh channel feature;

stacking layers using a third feature (i.e., CONC in FIG. 3) ³ ) Performing feature stacking processing on the eleventh channel feature and the second transition feature to obtain a twelfth channel feature;

treatment of the layer with the seventh feature (i.e., ESCP2 in FIG. 3 ⁷ ) Performing channel adjustment and feature extraction processing on the twelfth channel feature to obtain a thirteenth channel feature;

with a fifth downsampling layer (i.e., ESCPM in FIG. 3) ⁵ ) Downsampling the thirteenth channel feature to obtain a fourth transition feature;

With a second convolutionally superimposed layer (i.e., REPC in fig. 3 ² ) Performing convolution, normalization and feature superposition processing on the thirteenth channel feature to obtain a fourteenth channelFeatures;

based on the third preset step size, a fifth convolution normalization layer (i.e., CBS in fig. 3) is utilized ⁵ ) And carrying out channel adjustment and feature extraction processing on the fourteenth channel feature to obtain a second multi-scale feature matrix, wherein the second multi-scale feature matrix comprises a second preset number of grids and a target number of channels.

According to the embodiment of the present disclosure, the third preset step size may also be specifically set according to actual situations, and the step size s=1 is selected as the third preset step size in the present disclosure.

According to the embodiment of the disclosure, based on a third preset step size s=1, one fourth convolution normalization layer processes the second image feature of 40×40×1024 output by the channel adjustment sub-model to obtain a seventh channel feature of 40×40×256, and the other fourth convolution normalization layer processes the third transition feature of 20×20×512 to obtain an eighth channel feature of 20×20×256; and performing feature layer expansion processing on the eighth channel feature by using the second feature expansion layer to obtain a ninth channel feature of 40 multiplied by 256.

According to an embodiment of the present disclosure, performing feature stacking processing on the seventh channel feature and the ninth channel feature by using the second feature stacking layer to obtain a tenth channel feature of 40×40×512; performing channel adjustment and feature extraction processing on the tenth channel feature by using a sixth feature processing layer to obtain an eleventh channel feature of 40×40×256, namely a first transition feature; and performing feature stacking processing on the eleventh channel feature and the second transition feature by using the third feature stacking layer to obtain a twelfth channel feature of 40×40×512.

According to an embodiment of the present disclosure, a thirteenth channel feature of 40×40×256 is obtained by performing channel adjustment and feature extraction processing on a twelfth channel feature using a seventh feature processing layer; downsampling the thirteenth channel feature by using a fifth downsampling layer to obtain a fourth transition feature of 20×20×512; carrying out convolution, normalization and feature superposition processing on the thirteenth channel feature by using a second convolution superposition layer to obtain a fourteenth channel feature; and based on a third preset step length, carrying out channel adjustment and feature extraction processing on the fourteenth channel feature by utilizing a fifth convolution normalization layer to obtain a second multi-scale feature matrix, wherein the second multi-scale feature matrix comprises a second preset number of grids and a target number of channels.

According to the embodiment of the disclosure, the upsampling mode of the first feature expansion layer and the second feature expansion layer is that the UPS module of the nearest interpolation algorithm expands the feature layers, namely the length and the width of the feature layers are changed to 2 times of the original length and the width of the feature layers, and the number of channels is kept unchanged.

According to an embodiment of the present disclosure, referring to fig. 3, processing a third image feature and a fourth transition feature using a third multi-scale sub-model, outputting a multi-scale feature matrix and a third transition feature, comprising:

performing feature extraction, pooling and stacking treatment on the third image features by using a feature extraction stacking layer (namely SPPCM in fig. 3) to obtain third transitional features;

stacking layers using the fourth feature (i.e., CONC in FIG. 3) ⁴ ) Performing feature stacking processing on the third transition feature and the fourth transition feature to obtain a fifteenth channel feature;

treatment of the layer with the eighth feature (i.e., ESCP2 in FIG. 3 ⁸ ) Performing channel adjustment and feature extraction processing on the fifteenth channel feature to obtain a sixteenth channel feature;

with a third convolutionally superimposed layer (i.e., REPC in fig. 3 ³ ) Performing convolution, normalization and feature superposition processing on the sixteenth channel feature to obtain a seventeenth channel feature;

based on the fourth preset step size, a sixth convolution normalization layer (i.e., CBS in fig. 3) is utilized ⁶ ) And carrying out channel adjustment and feature extraction processing on the seventeenth channel feature to obtain a third multi-scale feature matrix, wherein the third multi-scale feature matrix comprises a third preset number of grids and a target number of channels.

According to the embodiment of the present disclosure, the fourth preset step size may also be specifically set according to actual situations, and the present disclosure selects step size s=1 as the fourth preset step size.

According to an embodiment of the present disclosure, feature extraction, pooling, and stacking are performed on the third image feature using the feature extraction stacking layer, resulting in a third transition feature of 20×20×512; and carrying out feature stacking processing on the third transition feature and the fourth transition feature by using the fourth feature stacking layer to obtain a fifteenth channel feature of 20 multiplied by 1024.

According to the embodiment of the disclosure, a sixteenth channel feature is obtained by performing channel adjustment and feature extraction processing on the fifteenth channel feature by using an eighth feature processing layer; carrying out convolution, normalization and feature superposition processing on the sixteenth channel feature by using a third convolution superposition layer to obtain a seventeenth channel feature; and based on a fourth preset step length s=1, performing channel adjustment and feature extraction processing on seventeenth channel features by using a sixth convolution normalization layer to obtain a third multi-scale feature matrix, wherein the third multi-scale feature matrix comprises a third preset number of grids and a target number of channels.

Fig. 9 schematically shows a block diagram of the SPPCM module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the feature extraction stack layer includes an SPPCM module as shown in FIG. 9 for increasing the receptive field of the model so that the object behavior classification model adapts to images of different resolutions. In the SPPCM module, two paths are formed, and in the first path, a CBM module with convolution kernel size of 1×1 and 3×3 and step size of 1 is used for respectively carrying out channel adjustment and feature extraction. The pooling operation with the pooling core sizes of 5×5, 9×9, 13×13 and 1×1 is adopted to increase the receptive field of the model to the multi-scale target, so that the model has stronger robustness to the multi-scale target. In the second path, the input features are channel adjusted by a CBM module with a convolution kernel size of 1 x 1 and a step size of 1. Then, the feature layers on the two paths are stacked through the CONC module, and the stacked feature layers are processed through the CBM module to serve as input features of the next layer.

Fig. 10 schematically illustrates a block diagram of the structure of a CBS module according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the sixth convolution normalization layer includes a CBS module as shown in fig. 10, which consists of Conv normal convolution operations, batch normalization operations, and a SiLU activation function.

Unlike the CBM module, the CBS module smoothes the output features of three multi-scale feature layers using a SiLU activation function. The Silu activation function is shown in equation (2).

Fig. 11 schematically illustrates a face feature key point coordinate position diagram according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, performing behavior classification processing on a plurality of prediction areas corresponding to a training video based on a mark point of the training video to obtain a behavior classification result of the training video, including:

processing the positions of the marking points in each preset area based on a preset key point model to obtain state parameters of each key point;

determining a behavior state of the object and a time and/or a number of times in the behavior state based on the plurality of state parameters;

classifying the training video into a first classification sub-result under the condition that the behavior state belongs to one of the classification lists and the time or the times meet a preset condition;

classifying the training video into a second classification sub-result under the condition that the behavior state belongs to one of the classification lists and the time or the times does not meet the preset condition;

in the event that the behavioral state does not belong to any of the classification lists, classifying the training video into a third classification sub-result, wherein the behavioral classification result includes a first classification sub-result, a second classification sub-result, and a third classification sub-result.

According to embodiments of the present disclosure, the preset keypoint model may comprise an eye model. The coordinate positions of the eye feature key points determined by the eye model are used for calculating the state parameters of the eyes, such as the actual aspect ratio, so that the state of the eyes, such as the open state or the closed state, is judged according to the preset aspect ratio threshold. Finally, the proportion of eye closure time or the closure times between the image frames are calculated, so that the behavior of the object is classified. As shown in fig. 11, the key point coordinate positions of the face features are shown.

Taking the left eye as an example, mathematical modeling of the eye may result in an eye model as shown in equation (3) according to embodiments of the present disclosure.

Wherein p is ₃₈ 、p ₄₂ 、p ₃₉ 、p ₄₁ 、p ₃₇ And p ₄₀ The left eye keypoint coordinate positions in fig. 11, respectively. The eye remains within a certain value range when the subject's eyes are open. When the subject's eyes are closed, the eye approaches 0. Thus, when eye is below a preset aspect ratio threshold, for example 0.3, the eye is in a closed state. When eye decreases from a certain value to a preset aspect ratio threshold and then rapidly increases above the preset aspect ratio threshold, then it may be defined as a blink, i.e., a closure number.

According to an embodiment of the present disclosure, let t ₁ For the time t when the eye is in 80% opening state at the moment of closing the eye ₂ For the time t when the eye is in 20% opening state at the moment of closing the eye ₃ For the time t when the eyes are in 20% opening state when the eyes are about to open ₄ The time when the eye is in 80% opening state when the eye is about to open. And defining the percentage of time occupied by the eyes in the closed state in a certain time period as t, and judging the time of the object in the behavior state through t, wherein the time is shown in a formula (4).

According to an embodiment of the present disclosure, if t is greater than or equal to 0.1, then the closing time of the object is deemed to exceed a preset condition. If t < 0.1, the closing time of the object is not determined to exceed the preset condition. Wherein the left and right eye mathematical models are similar.

According to an embodiment of the present disclosure, when the behavior state of the eyes of the subject belongs to the closed state in the classification list, if the closing time exceeds a preset time threshold (e.g., 0.1 seconds) or the closing number exceeds a number of times per unit time threshold (e.g., 10 times per 1 minute), the training video of the subject is classified as a first classification sub-result indicating that the subject is in a dozing state. If the behavior state of the eyes of the subject belongs to the closed state in the classification list, but neither the closing time nor the closing number exceeds a preset threshold, the training video of the subject may be classified as a second classification sub-result indicating that the subject occasionally closes eyes but does not sleeps. In the case that the behavior state does not belong to any of the classification lists, the training video is classified into a third classification sub-result, which indicates that the behavior state of the object is good.

According to an embodiment of the disclosure, the preset key point model may further include a mouth model, and the mouth actual aspect ratio is calculated according to the coordinate positions of the mouth feature key points determined by the mouth model, so that the mouth is judged to be in an open state or a closed state according to a preset aspect ratio threshold value. Finally, the mouth closing time proportion between each image frame is calculated and compared with the yawning threshold value, so that the behavior state of the object is classified. Wherein the mouth model is shown in formula (5).

In formula (5), p ₅₁ 、p ₅₉ 、p ₅₃ 、p ₅₇ 、p ₅₅ And p ₄₉ The mouth keypoint coordinate positions in fig. 11, respectively. If the mouth is more than or equal to 0.75, determining that the object is yawning, and accumulating the yawning times by 1. In the yawning detection, if the yawning number or time of the object is detected to exceed a preset threshold value (for example, the number of times is 2 times, and the time threshold value is 15 s) within a preset time period (for example, 30 s), the classification result that the object is in a fatigue state is determined.

Fig. 12 schematically illustrates a flowchart of an object behavior classification method according to an embodiment of the disclosure.

As shown in fig. 12, the object behavior classification method of the embodiment includes operations S1210 to S1220.

In operation S1210, a video to be classified is acquired, wherein the video to be classified includes a plurality of image frames to be classified that are associated in time sequence;

In operation S1220, a plurality of image frames to be classified of the video to be classified are input to the object behavior classification model, and a prediction behavior classification result is output, wherein the prediction behavior classification result characterizes a behavior gesture of the object in the case that the object exists in the video to be classified.

According to embodiments of the present disclosure, the video to be classified may be acquired by an image acquisition device, such as a video camera, or the like.

According to the embodiment of the disclosure, the classification of the behavior gesture of the object in the video to be classified can be realized by inputting the video to be classified into the object behavior classification model.

According to the embodiment of the disclosure, the multi-scale feature matrix in the image frame to be classified is extracted through the object behavior classification model, so that the prediction area where the object is located can be determined as much as possible, the object behaviors in the prediction area are classified by using the mark points of the key points, and the classification result of the behavior gesture of the object is obtained.

the target information is presented to the subject in a visual form.

According to embodiments of the present disclosure, the preset behavioral gesture may include smoking, using a cell phone, unbelting, both hands off the steering wheel, drinking water, yawning, closing eyes, and the like.

According to the embodiment of the disclosure, information corresponding to different categories may be stored in advance in the information list, for example, for smoking behaviors, "do not smoke" may be stored correspondingly, for yawning and eye-closing behaviors, "you are in fatigue state, do not drive fatigue" and the like may be stored correspondingly. It should be noted that the information in the information list may be specifically set according to actual requirements, and the foregoing is described only as an example.

According to the embodiment of the disclosure, in the case that the object has the above behavior, corresponding target information can be displayed to the object through various visual forms such as a display screen, a loudspeaker and the like.

According to the embodiment of the disclosure, the reminding information is sent to the object in a visual display mode, so that unnecessary potential safety hazards brought to driving and working by the non-compliance of the object in certain scenes can be avoided.

Fig. 13 schematically illustrates a flowchart of a travel safety detection method of a transport vehicle according to an embodiment of the present disclosure.

As shown in fig. 13, the running safety detection method of the transport vehicle of this embodiment includes operations S1310 to S1340.

In operation S1310, in-vehicle video of the transport vehicle is acquired in real time using an image acquisition apparatus of the transport vehicle in a case where the transport vehicle is traveling, wherein the in-vehicle video includes a plurality of in-vehicle images correlated in time sequence;

in operation S1320, transmitting a plurality of in-vehicle images of the in-vehicle video to the server, so that the server processes the in-vehicle video based on the object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result characterizes a behavior gesture of at least one object in the in-vehicle video;

in operation S1330, in case the in-vehicle behavior classification result indicates that the behavior gesture of the object belongs to the offending behavior gesture, determining first alarm information corresponding to the offending behavior gesture from the alarm information list and transmitting to the transport vehicle;

in operation S1340, first alarm information is presented to the subject in a visual form.

According to the embodiment of the disclosure, the transport vehicle may refer to a escort vehicle, when a driver (i.e., an object) performs a escort task, the state of the driver largely determines escort safety, if the driver has fatigue driving or a behavior without a safety belt, a vehicle driving safety accident is likely to occur, for this purpose, an image acquisition device such as a camera may be installed in the escort vehicle in advance, during the escort task is performed, the escort vehicle transmits an in-vehicle video acquired by the image acquisition device to a server in real time, and the server classifies the behavior gesture of the driver in the in-vehicle video in real time through an object behavior classification model, so as to obtain an in-vehicle behavior classification result.

According to the embodiment of the disclosure, under the condition that the behavior gesture of the object is indicated to belong to the illegal behavior gesture by the in-vehicle behavior classification result, for example, the driver has a fatigue state of eye closure or yawning, the first alarm information corresponding to the illegal behavior gesture is determined from the alarm information list and is transmitted to the transport vehicle, so that the first alarm information is sent to the driver through the alarm device installed in the transport vehicle, and the occurrence of escort safety accidents caused by the illegal behavior gesture of the driver is avoided as much as possible.

It should be noted that, the running safety detection method of the present disclosure may detect not only the behavior gesture of the driver, but also the behavior gesture of the person riding in the vehicle.

According to the embodiment of the disclosure, the multiscale feature matrix in the image frame of the video in the vehicle is extracted through the object behavior classification model, so that the prediction area where the object is located can be determined as much as possible, the object behavior in the prediction area is classified by using the marking points of the key points, and whether the object in the transportation vehicle has an illegal behavior gesture is determined, so that the alarm device sends alarm information to the object, and the safety and confidentiality of the transportation vehicle in the transportation task are improved due to the fact that the object behavior in the transportation vehicle is detected in real time in the running process of the transportation vehicle, and the problems of safety reduction and confidentiality reduction of the transportation task caused by the illegal behavior of the object in the execution of the transportation task are avoided.

It should be noted that, the object behavior classification model of the present disclosure may not only classify behaviors of drivers and passengers in a transport vehicle, but also classify behaviors of other objects, for example, classify the pose of an animal in a zoo, such as determining whether the animal is stationary, running or flying.

acquiring vehicle state parameters of the transport vehicle with the sensor module, wherein the vehicle state parameters include at least one of: tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

transmitting the vehicle state parameters to the server so that the server transmits second alarm information corresponding to the parameters to the transport vehicle when any one of the vehicle state parameters meets the alarm condition;

and displaying the second alarm information to the object in a visual form.

According to the embodiment of the disclosure, the second alarm information may be set according to actual conditions, for example, when the tire pressure is low, the second alarm information may be "the tire pressure is abnormal, please check whether the tire is broken" or not, and the like.

According to the embodiment of the disclosure, the sensor module comprises a vehicle positioning module, and the module collects the current running condition of the transport vehicle in real time through a vehicle-mounted GPS positioning navigation system, so that data sharing between escort personnel and a remote terminal is realized. If the transportation vehicle runs, reversely drives, runs the red light, runs at overspeed, runs at low speed and the like deviating from the preset running track, the vehicle-mounted terminal automatically sends an alarm signal to the application monitoring unit, the data center performs information sharing through the big data cloud, finally, real-time monitoring is realized through the application monitoring unit, and effective monitoring, scheduling and warning are rapidly performed on the running condition of the transportation vehicle.

According to embodiments of the present disclosure, the sensor module may also collect current operating parameters of the vehicle, such as dashboard parameters of tire pressure, temperature in the vehicle, engine status, oil amount, electrical quantity, vehicle speed, engine speed, etc. In addition, the sensor module can also comprise a collision sensor, and the collision sensor is used for judging whether an emergency such as collision occurs or not by collecting the collision coefficient of the vehicle, so that corresponding second alarm information is sent to the application monitoring unit and the driver in time after collision occurs.

According to the embodiment of the disclosure, the driving safety detection method can also determine the identity information of the current driver (or the passenger) through the in-vehicle video, the server judges whether the identity information belongs to the preset driver (or the preset passenger), and if the identity information does not belong to the preset driver (or the preset passenger), the server can timely send corresponding second alarm information to the application monitoring unit and the driver.

Fig. 14 schematically shows a block diagram of a training apparatus of an object behavior classification model according to an embodiment of the present disclosure.

As shown in fig. 14, the training apparatus 1400 of the object behavior classification model of this embodiment includes a first acquisition module 1410, a multi-scale module 1420, a first classification module 1430, a loss module 1440, and an iteration module 1450.

A first obtaining module 1410 configured to obtain a training set, where the training set includes a plurality of training videos and classification tags, the videos including a plurality of image frames that are associated in time sequence;

the multi-scale module 1420 is configured to input a training video into the initial behavior classification model, and output a plurality of multi-scale feature matrices corresponding to the training video, where each multi-scale feature matrix includes a plurality of prediction regions including different keypoints of the object;

the first classification module 1430 is configured to perform behavior classification processing on a plurality of prediction areas corresponding to the training video based on the mark points of the training video, so as to obtain a behavior classification result of the training video, where the mark points represent positions of different key points of the object in each image frame;

the loss module 1440 is configured to input the classification result and the classification label corresponding to each training video into a loss function, and output a loss result;

An iteration module 1450 for iteratively adjusting network parameters of the initial behavior classification model based on the loss results, generating a trained object behavior classification model.

In accordance with an embodiment of the present disclosure, in the case where the number of multi-scale feature matrices is three, the multi-scale module 1420 includes a feature extraction sub-module, a first acquisition sub-module, and a second acquisition sub-module.

The feature extraction sub-module is used for carrying out channel adjustment and feature extraction processing on a plurality of image frames by utilizing the feature extraction sub-model based on a first preset step length to obtain a first image feature;

the first obtaining submodule is used for processing the first image characteristic by utilizing the channel adjustment submodule to obtain a second image characteristic and a third image characteristic;

the second obtaining submodule is used for respectively processing the first image feature, the second image feature and the third image feature by using the first multi-scale submodel, the second multi-scale submodel and the third multi-scale submodel to obtain three multi-scale feature matrixes.

According to an embodiment of the present disclosure, the feature extraction submodule includes a first obtaining unit, a second obtaining unit, a first downsampling unit, and a third obtaining unit.

The first obtaining unit is used for carrying out channel adjustment and feature extraction processing on a plurality of image frames by using a plurality of first convolution normalization layers to obtain first intermediate features, wherein one convolution normalization layer corresponds to one first preset step length;

the second obtaining unit is used for carrying out channel adjustment and feature stacking processing on the first intermediate features by utilizing the first feature processing layer to obtain second intermediate features;

The first downsampling unit is used for downsampling the second intermediate feature by using the first downsampling layer to obtain a third intermediate feature;

and the third obtaining unit is used for carrying out channel adjustment and feature stacking processing on the third intermediate features by utilizing the second feature processing layer to obtain the first image features.

According to an embodiment of the disclosure, the first obtaining sub-module includes a second downsampling unit, a fourth obtaining unit, a third downsampling unit, and a fifth obtaining unit.

The second downsampling unit is used for downsampling the first image feature by using the second downsampling layer to obtain a fourth intermediate feature;

the fourth obtaining unit is used for carrying out channel adjustment and feature extraction processing on the fourth intermediate features by utilizing the third feature processing layer to obtain second image features;

the third downsampling unit is used for downsampling the second image features by using the third downsampling layer to obtain fifth intermediate features;

and a fifth obtaining unit, configured to perform channel adjustment and feature extraction processing on the fifth intermediate feature by using the fourth feature processing layer, so as to obtain a third image feature.

According to an embodiment of the present disclosure, the second obtaining submodule includes a first output unit, a second output unit, and a third output unit.

The first output unit is used for processing the first image features and the first transition features by using the first multi-scale submodel and outputting a multi-scale feature matrix and the second transition features;

the second output unit is used for processing the second image feature, the second transition feature and the third transition feature by utilizing the second multi-scale submodel and outputting a multi-scale feature matrix, a first transition feature and a fourth transition feature;

and the third output unit is used for processing the third image feature and the fourth transition feature by using the third multi-scale submodel and outputting a multi-scale feature matrix and the third transition feature.

According to an embodiment of the present disclosure, the first output unit includes a first obtaining subunit, a second obtaining subunit, a third obtaining subunit, a fourth obtaining subunit, a first downsampling subunit, a fifth obtaining subunit, and a sixth obtaining subunit.

The first obtaining subunit is used for respectively carrying out channel adjustment and feature extraction processing on the first image features and the first transition features by utilizing two second convolution normalization layers based on a second preset step length to obtain first channel features and second channel features;

the second obtaining subunit is used for performing feature layer expansion processing on the second channel feature by utilizing the first feature expansion layer to obtain a third channel feature;

A third obtaining subunit, configured to perform feature stacking processing on the first channel feature and the third channel feature by using the first feature stacking layer to obtain a fourth channel feature;

a fourth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the fourth channel feature by using a fifth feature processing layer to obtain a fifth channel feature, where the fifth channel feature includes two sub-channel features with preset channel numbers;

the first downsampling subunit is used for downsampling one subchannel characteristic by utilizing the fourth downsampling layer to obtain a second transition characteristic;

fifth obtaining a subunit, configured to perform convolution, normalization and feature stacking processing on another sub-channel feature by using the first convolution and superposition layer to obtain a sixth channel feature;

and a sixth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the sixth channel feature by using a third convolution normalization layer based on the second preset step size, to obtain a first multi-scale feature matrix, where the first multi-scale feature matrix includes a first preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the second output unit includes a seventh obtaining subunit, an eighth obtaining subunit, a ninth obtaining subunit, a tenth obtaining subunit, an eleventh obtaining subunit, a twelfth obtaining subunit, a second downsampling subunit, a thirteenth obtaining subunit, and a fourteenth obtaining subunit.

A seventh obtaining subunit, configured to perform channel adjustment and feature extraction processing on the second image feature and the third transition feature by using two fourth convolution normalization layers based on a third preset step size, to obtain a seventh channel feature and an eighth channel feature;

an eighth obtaining subunit, configured to perform feature layer expansion processing on the eighth channel feature by using the second feature expansion layer to obtain a ninth channel feature;

a ninth obtaining subunit, configured to perform feature stacking processing on the seventh channel feature and the ninth channel feature by using the second feature stacking layer to obtain a tenth channel feature;

a tenth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the tenth channel feature by using a sixth feature processing layer, to obtain an eleventh channel feature;

an eleventh obtaining subunit, configured to perform feature stacking processing on the eleventh channel feature and the second transition feature by using the third feature stacking layer to obtain a twelfth channel feature;

a twelfth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the twelfth channel feature by using a seventh feature processing layer to obtain a thirteenth channel feature;

the second downsampling subunit is used for downsampling the thirteenth channel characteristic by using the fifth downsampling layer to obtain a fourth transition characteristic;

A thirteenth obtaining subunit, configured to perform convolution, normalization and feature stacking processing on the thirteenth channel feature by using the second convolution and superposition layer to obtain a fourteenth channel feature;

the fourteenth obtaining subunit is configured to perform channel adjustment and feature extraction processing on the fourteenth channel feature by using a fifth convolution normalization layer based on a third preset step size, so as to obtain a second multi-scale feature matrix, where the second multi-scale feature matrix includes a second preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the third output unit includes a fifteenth obtaining subunit, a sixteenth obtaining subunit, a seventeenth obtaining subunit, an eighteenth obtaining subunit, and a nineteenth obtaining subunit.

A fifteenth obtaining subunit, configured to perform feature extraction, pooling and stacking processing on the third image feature by using the feature extraction stacking layer to obtain a third transition feature;

a sixteenth obtaining subunit, configured to perform feature stacking processing on the third transition feature and the fourth transition feature by using the fourth feature stacking layer to obtain a fifteenth channel feature;

seventeenth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the fifteenth channel feature by using an eighth feature processing layer to obtain a sixteenth channel feature;

An eighteenth obtaining subunit, configured to perform convolution, normalization and feature stacking processing on the sixteenth channel feature by using a third convolution and superposition layer to obtain a seventeenth channel feature;

and a nineteenth obtaining subunit, configured to perform channel adjustment and feature extraction processing on the seventeenth channel feature by using a sixth convolution normalization layer based on a fourth preset step size to obtain a third multi-scale feature matrix, where the third multi-scale feature matrix includes a third preset number of grids and a target number of channels.

According to an embodiment of the present disclosure, the first classification module 1430 includes a processing sub-module, a determination sub-module, a first classification sub-module, a second classification sub-module, and a third classification sub-module.

The processing sub-module is used for processing the positions of the marking points in each preset area based on a preset key point model to obtain the state parameters of each key point;

a determining submodule, configured to determine a behavior state of the object and a time and/or a number of times in the behavior state based on the plurality of state parameters;

the first classification sub-module is used for classifying the training video into a first classification sub-result when the behavior state belongs to one of the classification lists and the time or the times meet the preset condition;

The second classification sub-module is used for classifying the training video into a second classification sub-result when the behavior state belongs to one of the classification lists and the time or the times do not meet the preset condition;

and the third classification sub-module is used for classifying the training video into a third classification sub-result in the case that the behavior state does not belong to any one of the classification lists, wherein the behavior classification result comprises a first classification sub-result, a second classification sub-result and a third classification sub-result.

Fig. 15 schematically shows a block diagram of a structure of an object behavior classification apparatus according to an embodiment of the disclosure.

As shown in fig. 15, the object behavior classification apparatus 1500 of this embodiment includes a second acquisition module 1510 and a second classification module 1520.

A second obtaining module 1510, configured to obtain a video to be classified, where the video to be classified includes a plurality of image frames to be classified that are associated in time sequence;

the second classification module 1520 is configured to input a plurality of image frames to be classified of the video to the object behavior classification model, and output a prediction behavior classification result, where the prediction behavior classification result characterizes a behavior gesture of the object in a case that the object exists in the video to be classified.

According to the embodiment of the disclosure, the object behavior classification device further comprises a determination module and a display module.

The determining module is used for determining target information corresponding to the preset behavior gesture from the information list under the condition that the predicted behavior classification result indicates that the behavior gesture of the object belongs to the preset behavior gesture;

and the display module is used for displaying the target information to the object in a visual form.

According to embodiments of the present disclosure, any of the first acquisition module 1410, the multi-scale module 1420, the first classification module 1430, the loss module 1440, the iteration module 1450, or the second acquisition module 1510, the second classification module 1520 may be combined in one module to be implemented, or any of them may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first acquisition module 1410, the multi-scale module 1420, the first classification module 1430, the penalty module 1440, the iteration module 1450, or the second acquisition module 1510, the second classification module 1520 may be implemented, at least in part, as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of, or in any suitable combination of, software, hardware, and firmware. Alternatively, at least one of the first acquisition module 1410, the multi-scale module 1420, the first classification module 1430, the loss module 1440, the iteration module 1450, or the second acquisition module 1510, the second classification module 1520 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.

Fig. 16 schematically illustrates a block diagram of a vehicle monitoring system according to an embodiment of the present disclosure.

As shown in fig. 16, the vehicle monitoring system 1600 of this embodiment includes:

server 1610, server 1610 is configured to:

processing the in-vehicle video based on the object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result represents the behavior gesture of at least one object in the in-vehicle video;

transport vehicle 1620, transport vehicle 1620 includes:

a vehicle body;

an image capturing device configured to capture in-vehicle video of the transport vehicle 1620 in real time and transmit to the server 1610 in a case where the vehicle main body is traveling, wherein the in-vehicle video includes a plurality of in-vehicle images correlated in time series;

an alarm device configured to present first alarm information to the subject in a visual form.

In-vehicle video may be uploaded to the server 1610 through the in-vehicle terminal 1630 according to embodiments of the present disclosure. The server 1610 transmits the classification result to the big data cloud 1642 through the data center 1641 in the platform sharing unit 1640, so that the application supervision unit 1650 displays the in-vehicle video and the classification result when the object has the offending phase gesture through various display devices in real time.

According to the embodiment of the disclosure, the multiscale feature matrix in the image frame of the video in the vehicle is extracted through the object behavior classification model, so that the prediction area where the object is located can be determined as much as possible, the object behavior in the prediction area is classified by using the marking points of the key points, and whether the object in the transportation vehicle 1620 has an illegal behavior gesture is determined, so that the alarm device sends alarm information to the object, and the safety and confidentiality of the transportation vehicle 1620 in the transportation task are improved due to the fact that the object behavior in the transportation vehicle 1620 is detected in real time in the running process of the transportation vehicle 1620, and the problems of safety reduction and confidentiality reduction of the transportation task caused by the illegal behavior of the object in the execution of the transportation task are avoided.

According to an embodiment of the present disclosure, the vehicle monitoring system 1600 further includes:

a sensor module configured to:

the vehicle state parameters of the transport vehicle 1620 are collected and transmitted to the server 1610, wherein the vehicle state parameters include at least one of: tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

Wherein the server 1610 is further configured to transmit second alarm information corresponding to the parameters to the alarm device in case any one of the vehicle state parameters satisfies the alarm condition;

the alarm device is further configured to present the second alarm information to the subject in a visual form.

According to the embodiment of the disclosure, the vehicle monitoring system 1600 further includes a platform sharing unit 1640 and an application monitoring unit 1650, where the application monitoring unit 1650 can perform two-way communication with the server 1610 and the transport vehicle 1620 in real time through the platform sharing unit 1640, so that in-vehicle video and vehicle state parameters can be displayed in the application monitoring unit 1650 in real time, so that a supervisor can know the driving safety of the transport vehicle 1620 in time, monitoring and scheduling of the transport vehicle 1620 are achieved in case of an emergency, and transport safety, standardization and privacy are effectively guaranteed.

As shown in fig. 17, the electronic device 1700 according to the embodiment of the present disclosure includes a processor 1701 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1702 or a program loaded from a storage portion 1708 into a Random Access Memory (RAM) 1703. The processor 1701 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special-purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1701 may also include on-board memory for caching purposes. The processor 1701 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 1703, various programs and data necessary for the operation of the electronic device 1700 are stored. The processor 1701, the ROM 1702, and the RAM 1703 are connected to each other through a bus 1704. The processor 1701 performs various operations of the method flow according to an embodiment of the present disclosure by executing programs in the ROM 1702 and/or the RAM 1703. Note that the program may be stored in one or more memories other than the ROM 1702 and the RAM 1703. The processor 1701 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1700 may also include an input/output (I/O) interface 1705, the input/output (I/O) interface 1705 also being connected to the bus 1704. The electronic device 1700 may also include one or more of the following components connected to an input/output (I/O) interface 1705: an input section 1706 including a keyboard, a mouse, and the like; an output portion 1707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 1708 including a hard disk or the like; and a communication section 1709 including a network interface card such as a LAN card, a modem, or the like. The communication section 1709 performs communication processing via a network such as the internet. The driver 1710 is also connected to an input/output (I/O) interface 1705 as needed. A removable medium 1711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1710, so that a computer program read therefrom is installed into the storage portion 1708 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 1702 and/or RAM 1703 described above and/or one or more memories other than ROM 1702 and RAM 1703.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1701. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted in the form of signals over a network medium, distributed, and downloaded and installed via the communication portion 1709, and/or from the removable medium 1711. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1709, and/or installed from the removable media 1711. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1701. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of training a classification model of object behavior, comprising:

obtaining a training set, wherein the training set comprises a plurality of training videos and classification labels, and the videos comprise a plurality of image frames which are associated in time sequence;

iteratively adjusting network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

2. The method of claim 1, wherein, in the case that the number of the multi-scale feature matrices is three, the inputting the training video to an initial behavior classification model, outputting a plurality of multi-scale feature matrices corresponding to the training video, comprises:

and respectively processing the first image feature, the second image feature and the third image feature by using a first multi-scale sub-model, a second multi-scale sub-model and a third multi-scale sub-model to obtain three multi-scale feature matrixes.

3. The method according to claim 2, wherein the performing channel adjustment and feature extraction processing on the plurality of image frames by using a feature extraction sub-model based on a first preset step size to obtain a first image feature includes:

performing channel adjustment and feature stacking processing on the first intermediate features by using a first feature processing layer to obtain second intermediate features;

4. The method of claim 2, wherein the processing the first image feature using the channel adjustment sub-model to obtain a second image feature and a third image feature comprises:

performing channel adjustment and feature extraction processing on the fourth intermediate feature by using a third feature processing layer to obtain the second image feature;

5. The method of claim 2, wherein the processing the first image feature, the second image feature, and the third image feature with the first multi-scale sub-model, the second multi-scale sub-model, and the third multi-scale sub-model, respectively, results in three multi-scale feature matrices, comprising:

and processing the third image feature and the fourth transition feature by using the third multi-scale sub-model, and outputting a multi-scale feature matrix and the third transition feature.

6. The method of claim 5, wherein said processing said first image feature and first transition feature using said first multi-scale sub-model to output one of said multi-scale feature matrix and second transition feature comprises:

performing feature layer expansion processing on the second channel features by using the first feature expansion layer to obtain third channel features;

performing channel adjustment and feature extraction processing on the fourth channel feature by using a fifth feature processing layer to obtain a fifth channel feature, wherein the fifth channel feature comprises two sub-channel features with preset channel numbers;

carrying out convolution, normalization and feature superposition processing on the other sub-channel feature by using a first convolution superposition layer to obtain a sixth channel feature;

7. The method of claim 5 or 6, wherein said processing the second image feature, the second transition feature, and the third transition feature with the second multi-scale sub-model to output one of the multi-scale feature matrix, the first transition feature, and the fourth transition feature comprises:

and based on the third preset step length, carrying out channel adjustment and feature extraction processing on the fourteenth channel feature by utilizing a fifth convolution normalization layer to obtain a second multi-scale feature matrix, wherein the second multi-scale feature matrix comprises a second preset number of grids and a target number of channels.

8. The method of claim 5, wherein said processing said third image feature and said fourth transition feature using said third multi-scale sub-model to output one of said multi-scale feature matrix and said third transition feature comprises:

performing feature extraction, pooling and stacking treatment on the third image features by using a feature extraction stacking layer to obtain the third transition features;

9. The method of claim 1, wherein the performing, based on the labeled points of the training video, the behavior classification processing on the plurality of prediction areas corresponding to the training video to obtain the behavior classification result of the training video includes:

Classifying the training video as a first classification sub-result if the behavior state belongs to one of the classification lists and if the time or the number of times satisfies a preset condition;

classifying the training video as a second classification sub-result if the behavioral state belongs to one of the classification lists and if the time or the number of times does not satisfy a preset condition;

and classifying the training video into a third classification sub-result in the case that the behavior state does not belong to any one of the classification lists, wherein the behavior classification result comprises the first classification sub-result, the second classification sub-result and the third classification sub-result.

10. An object behavior classification method, comprising:

inputting a plurality of image frames to be classified of the video to be classified into an object behavior classification model, and outputting a prediction behavior classification result, wherein the prediction behavior classification result represents the behavior gesture of an object under the condition that the object exists in the video to be classified;

wherein the object behavior classification model is trained using the method of any one of claims 1 to 9.

11. The method of claim 10, further comprising:

determining target information corresponding to a preset behavior gesture from an information list under the condition that the predicted behavior classification result shows that the behavior gesture of the object belongs to the preset behavior gesture;

the target information is presented to the object in a visual form.

12. A travel safety detection method of a transport vehicle, comprising:

transmitting a plurality of in-vehicle images of the in-vehicle video to a server, so that the server processes the in-vehicle video based on an object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result characterizes a behavior gesture of at least one object in the in-vehicle video, and the object behavior classification model is trained by the method according to any one of claims 1 to 9;

under the condition that the in-vehicle behavior classification result shows that the behavior gesture of the object belongs to an illegal behavior gesture, determining first alarm information corresponding to the illegal behavior gesture from an alarm information list and transmitting the first alarm information to the transport vehicle;

The first alarm information is displayed to the object in a visual form.

13. The method of claim 12, further comprising:

collecting a vehicle state parameter of the transport vehicle with a sensor module, wherein the vehicle state parameter comprises at least one of: tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

the second alarm information is displayed to the object in a visual form.

14. A training apparatus for an object behavior classification model, comprising:

a first acquisition module configured to acquire a training set, where the training set includes a plurality of training videos and classification tags, the videos including a plurality of image frames that are associated in time sequence;

The first classification module is used for performing behavior classification processing on a plurality of prediction areas corresponding to the training video based on the marking points of the training video to obtain behavior classification results of the training video, wherein the marking points represent the positions of different key points of the object in each image frame;

and the iteration module is used for iteratively adjusting network parameters of the initial behavior classification model according to the loss result to generate a trained object behavior classification model.

15. An object behavior classification apparatus comprising:

the second classification module is used for inputting a plurality of image frames to be classified of the video to be classified into an object behavior classification model and outputting a prediction behavior classification result, wherein the prediction behavior classification result represents the behavior gesture of an object under the condition that the object exists in the video to be classified;

16. A vehicle monitoring system comprising:

a server configured to:

processing an in-vehicle video based on an object behavior classification model to obtain an in-vehicle behavior classification result, wherein the in-vehicle behavior classification result characterizes the behavior gesture of at least one object in the in-vehicle video, and the object behavior classification model is trained by the method according to any one of claims 1 to 9;

a transport vehicle, the transport vehicle comprising:

a vehicle body;

the alarm device is configured to present the first alarm information to the subject in a visual form.

17. The system of claim 16, further comprising:

a sensor module configured to:

collecting vehicle state parameters of the transport vehicle and transmitting the vehicle state parameters to the server, wherein the vehicle state parameters include at least one of:

tire pressure, temperature in the vehicle, engine state, cruising ability, vehicle speed, engine speed and running track;

18. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-11.

19. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-11.

20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.