CN111860253A

CN111860253A - Multitask attribute identification method, multitask attribute identification device, multitask attribute identification medium and multitask attribute identification equipment for driving scene

Info

Publication number: CN111860253A
Application number: CN202010662803.9A
Authority: CN
Inventors: 顾一新
Original assignee: Dongguan Zhengyang Electronic Mechanical Co ltd
Current assignee: Dongguan Zhengyang Electronic Mechanical Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-30

Abstract

The embodiment of the application discloses a multitask attribute identification method, a multitask attribute identification device, a multitask attribute identification medium and multitask attribute identification equipment for a driving scene. The method comprises the following steps: acquiring image data of a driver in a driving scene; inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification; and determining the multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model. By executing the technical scheme, the multitask network model can be adopted, various characteristics of the driver are recognized at the same time, and an output result capable of reflecting the driving state of the driver integrally is output, so that the aim of improving the accuracy of recognizing the state of the driver is fulfilled.

Description

Multitask attribute identification method, multitask attribute identification device, multitask attribute identification medium and multitask attribute identification equipment for driving scene

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an image recognition technology, and particularly relates to a multitask attribute recognition method, device, medium and equipment for a driving scene.

Background

In the ADAS system, accurate and timely provision of the characteristic attributes of a driver is a key factor for realizing the intelligentization of a safe driving system. A common approach is to use different models to identify different driver attributes, respectively.

For a driver attribute identification method, a deep learning-based method is generally adopted to identify certain attribute features of a driver, and in the traditional method, each attribute feature of the driver is identified by combining a plurality of models, but firstly, the scene of the traditional method is single, and the traditional method is difficult to deal with the scene in a driver cab with shielding, poor light, multiple persons, missing important attributes and the like. Secondly, if the driver attributes are identified by adopting a plurality of models, the response speed of the system is slowed, the size of the models is increased, the hardware requirements are increased, and all the attribute features cannot be ensured to be the same person under the condition of a plurality of persons. Finally, aiming at the recognition of smoking and calling attributes, most of the existing methods use an anchor-based neural network model, so that the models cannot be fused with other attributes, and the calculation difficulty and the labeling difficulty are increased.

Therefore, it is necessary to design a method that includes all the important attribute features of the driver and has better real-time performance and accuracy.

Disclosure of Invention

The embodiment of the application provides a multitask attribute identification method, a multitask attribute identification device, a multitask attribute identification medium and multitask attribute identification equipment for a driving scene, wherein a multitask network model can be adopted to simultaneously identify various characteristics of a driver, and an output result capable of reflecting the driving state of the driver integrally is output, so that the aim of improving the accuracy of driver state identification is fulfilled.

In a first aspect, an embodiment of the present application provides a multitask attribute identification method for a driving scenario, where the method includes:

acquiring image data of a driver in a driving scene camera video;

inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification;

and determining the multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model.

Further, inputting the image data into a trained multitask network model for attribute recognition, including:

preprocessing the image data;

inputting the preprocessed image data into a pre-trained multitask network model for attribute recognition to obtain the attribute recognition result of the multitask network model on whether a mask is worn, whether glasses are worn, whether smoke is drawn and whether a call is made;

Caching an attribute identification result of the target task;

after caching the attribute identification result of the specific duration, judging whether the attribute of the target task exists according to the judgment ratio on the cached attribute identification result; and if so, determining the target task attribute of the sample image.

Further, the image data is preprocessed, including:

acquiring a sample image, and determining the coding characteristics of a human face in the sample image;

determining a face frame of a driver by using a face recognition model according to the coding features;

and adjusting according to the face frame of the driver, cutting the face of the driver to obtain a cut image, and finally performing normalization processing.

Further, the training of the multitask network model comprises:

inputting the preprocessed image data into a multi-task main network to obtain a feature map, respectively extracting the feature maps aiming at different attribute identification tasks by adopting an OSEM module, and respectively inputting the feature maps into respective attribute identification modules, wherein the attribute identification modules are respectively an attribute identification module for judging whether a mask is worn, an attribute identification module for judging whether glasses are worn, an attribute identification module for judging whether smoking is taken or not and an attribute identification module for judging whether a call is made or not;

The output of the whole network is a characteristic diagram of each task and an attribute identification result of each task, and the output results of the attribute identification of each task are whether a mask is worn, whether glasses are worn, whether smoke is drawn or not and whether a call is made or not;

comparing the attribute identification result with the labeling result, reducing the difference through the corresponding loss function, inputting the characteristic diagram of each task into the corresponding loss function, and optimizing the network parameters;

the weights of the loss functions are adjusted to optimize the training results for the multitask network model and reduce the training time.

Further, before the training of the multitask network model, the method further comprises:

labeling multitask attributes to sample image data;

determining an initial architecture of the multitask network model;

a loss function is determined.

Furthermore, the output category numbers of the attribute identification tasks of whether to wear a mask, whether to smoke and whether to make a call are two categories;

the number of output categories of the attribute recognition task of whether glasses are worn includes: wearing infrared blocking glasses, wearing ordinary glasses and not wearing glasses.

Further, to whether wear gauze mask and whether wear glasses attribute identification module, every module is used for attribute identification output respectively and whether wears gauze mask, whether wears the result of glasses.

Further, the attribute identification modules of whether smoking is performed and whether a call is made are included, and each identification module comprises a main body identification module and an attention module;

the main body identification module is used for identifying and outputting results of whether smoking or not and whether a call is made or not according to attributes;

the attention module comprises an attention suggestion subnetwork and an attention network and is used for attribute identification and optimizing an attribute identification result of the main network;

further, the attention suggestion sub-network mainly includes two fully-connected layers connected in series for outputting the center point coordinates of the attention area and the side length of the attention area to perform the cropping and enlarging process on the original input image.

Further, the attention network analyzes the cut and amplified image through network, and outputs new results of whether smoking or not and whether calling or not, or can cut and amplify the image again according to needs, perform network analysis, and output a recognition result.

Further, the classification loss and Lrank loss are calculated by using the fine-grained features generated by the attention network, so that the attribute analysis results of the main network and the attention network are constrained.

Further, after the multitask network model training, the method further comprises:

Selecting a network architecture and a model thereof with optimal attribute analysis results according to the test accuracy;

and the network and the parameters thereof which do not participate in the final attribute identification output are deleted, and the size of the network model and the forwarding time are reduced.

In a second aspect, an embodiment of the present application provides a multitask attribute identification device for a driving scenario, where the device includes:

the image data acquisition module is used for acquiring image data of a driver in a driving scene;

the recognition module is used for inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification;

and the multitask attribute identification result determining module is used for determining the multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a multitask attribute identification method for a driving scenario according to the present application.

In a fourth aspect, the present application provides an apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for multitask attribute identification of a driving scenario according to the present application when executing the computer program.

According to the technical scheme provided by the embodiment of the application, the image data of a driver in a driving scene is acquired; inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification; and determining the multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model. By adopting the technical scheme provided by the application, the multitask network model can be adopted, various characteristics of the driver can be identified at the same time, and the output result capable of reflecting the driving state of the driver integrally is output, so that the aim of improving the accuracy of identifying the state of the driver is fulfilled.

Drawings

FIG. 1 is a flowchart of a multitask attribute identification method for a driving scenario provided by an embodiment of the application;

FIG. 2 is a flow chart of an attribute identification process of a multitasking network model provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating a training process of a multitask network model provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an attention suggestion subnetwork as provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a multitask attribute identifying device for a driving scenario according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Fig. 1 is a flowchart of a multitask attribute identification method of a driving scenario provided in an embodiment of the present application, where the present embodiment is applicable to a situation of driver state identification, and the method may be executed by a multitask attribute identification device of a driving scenario provided in an embodiment of the present application, where the device may be implemented by software and/or hardware, and may be integrated in a device such as an intelligent terminal.

As shown in fig. 1, the multitask attribute identification method for the driving scenario includes:

and S110, acquiring image data of a driver in a driving scene.

The image data of the driver can be acquired through a common camera, an infrared camera and the like, and can be arranged at the position of a rearview mirror of the vehicle or other positions as long as the face image of the driver can be acquired. According to the requirements of actual application scenes, an infrared 940 camera is adopted, a collection hardware and software platform is built, and video data collection is carried out in a cab in the vehicle.

In this embodiment, it can be understood that after the image data is acquired, operations such as cleaning, storing, labeling, and the like can be performed on the image data. Specifically, the image frames are extracted in batch from the acquired data, and the image data is cleaned and the target frame is labeled. The infrared camera and the common camera are adopted for data acquisition, so that stable imaging output in a complex environment can be ensured, and the stability of image characteristics is ensured, thereby being beneficial to the learning of a network model, and being capable of relieving or avoiding problems such as backlight to a great extent.

In the scheme, the collected RGB image can be input into a backbone network to extract the coding characteristics of the image; adopting a driving scene face recognition neural network to analyze the coding characteristics, and if the coding characteristics are higher than a set threshold value, giving a face frame coordinate value closest to the last result; if the output value is lower than the set threshold value, directly outputting: no driver is present and the process is finished; and properly enlarging the image according to the coordinate value of the face frame in proportion, cutting the input image, and normalizing and unifying the cut image.

S120, inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and telephone identification.

The acquired image data may be input to a pre-trained multi-task network model. The identification tasks of the multitask network model comprise: mask identification, glasses identification, smoking identification and telephone identification. The different recognition tasks can be simultaneously carried out, for example, after the characteristics of the image are determined, the multiple recognition tasks can be simultaneously completed, and if corresponding information is recognized, for example, the mask is recognized, the information that the driver has the mask is displayed on the output of the mask recognition task.

It can be understood that before the multitask network model training, the marking needs to be carried out according to whether the driver actually has a mask or glasses, and whether smoking or not and calling actions exist, but the method does not need to mark position information and is a method for weakly monitoring the anchor-free. Compared with the traditional method, the method has the advantage that the labeling task is reduced. Thus the annotation information and the driver image will be used as input to train the model. In the application process, information does not need to be marked, and the judgment results of whether the driver is provided with the mask or the glasses, and whether the driver smokes or not and calls can be output only by the image of the driver.

And S130, determining a multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model.

Wherein, the output result of the model can be determined according to the output result of each recognition task. And after the model outputs a corresponding result, determining a multitask attribute recognition result of the driver.

According to the technical scheme provided by the embodiment of the application, the image data of a driver in a driving scene is acquired; inputting the image data into a pre-trained multi-task network model for attribute recognition; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification; and determining the multitask attribute identification result of the driver according to the attribute identification output result of the multitask network model. By adopting the technical scheme provided by the application, the multitask network model can be adopted, various characteristics of the driver can be identified at the same time, and the output result capable of reflecting the driving state of the driver integrally is output, so that the purposes of accuracy and instantaneity of driver state identification are improved.

Fig. 2 is a flowchart of an attribute identification process of a multitask network model according to an embodiment of the present application, where as shown in fig. 2, the attribute identification process of the multitask network model includes:

S210, preprocessing the image data.

Inputting the collected and cut RGB image into the backbone network to extract the coding feature of the image. Specifically, the main network adopts an SE-ResNet18 network, and the characteristic diagram of the output scale is 7 × 7. These 7 × 7 feature maps are further processed by a one squeeze multi-instance (osme) module to extract 4 different types of feature maps, namely, mask, glasses, smoking and calling feature maps, which are respectively used for judging the features of different attribute identification tasks.

In this technical solution, optionally, the preprocessing the image data includes:

The method comprises the steps that a driving scene face recognition neural network is adopted to analyze coding features, and if the coding features are higher than a set threshold value, face frame coordinate values closest to the last result are given; if the output value is lower than the set threshold value, directly outputting: without the driver and end. Here, since the moving range of the face of the driver in the vehicle is limited, the face frame closest to the last result may be determined as the face frame of the driver.

And (3) properly expanding the coordinate values of the face frame in proportion, cutting the input image, and normalizing and unifying the cut image.

S220, inputting the preprocessed image data into a pre-trained multi-task network model for attribute recognition, and obtaining the attribute recognition results of the multi-task network model for the target tasks of whether wearing a mask, wearing glasses, smoking and calling.

In the scheme, the multi-task network model can be used for identifying the cut image, and determining whether to wear a mask, wear glasses or infrared blocking glasses, smoke or not and make a call or not.

And S230, caching the attribute identification result of the target task.

In this embodiment, optionally, before inputting the cropped image to the multitask network model for training, the method further includes:

labeling multitask attributes to sample image data;

determining an initial architecture of the multitask network model;

a loss function is determined.

Accordingly, the initial architecture of the multitask network model adopts an SE-ResNet18 network, and the characteristic diagram of the output scale is 7 × 7. These 7 × 7 feature maps are further processed by a one squeeze multi-instance (osme) module to extract 4 different types of feature maps, namely, mask, glasses, smoking and calling feature maps, which are respectively used for judging the features of different attribute identification tasks.

Determining attribute classification losses generated by the four attribute identification tasks when multi-task attribute identification is performed; the attribute classification loss is used for comparing the difference between the network output result and the labeling result in the training process and reducing the difference by forwarding and back propagation to the multi-task network model. These attribute classification losses are multiplied by different weights, which depend on the difficulty of the attribute feature calculation.

Wherein 4 groups 7 x 7 of attention feature maps identified by 4 classes of different attributes will be input together into the metric loss function.

Specific embodiments of metric loss: the attention feature maps of 4 groups 7-7 are elongated into vectors, the similarity of different images in a batch is calculated, and the attention similarity S of the same task category is calculated^saAttention similarity S to different classes^daHigh, so the metric loss function is:

where n is the number of classes in a batch multiplied by the number of different classes. pos is the number of the same category.

The measure loss function is mainly used for making the attention feature maps of whether smoking is performed, whether a call is made, whether a mask is worn and whether glasses are worn mutually exclusive. The different attribute features are made different in the attention discriminating region of the original image.

The preprocessed sample image data correspond to the marked multi-task attributes one by one and are arranged in sequence; and inputting the preprocessed sample image into a network, comparing the obtained final output with the labeling result one by one according to a loss function, and reducing the difference by repeatedly forwarding and reversely forwarding. And finally, converging to the degree that the difference between the network output and the labeled result is negligible, wherein each parameter in the network is the required final result. And adjusting the loss weight according to the multitask attribute identification result and the convergence speed so as to optimize the training result and speed of the multitask network model. According to the scheme, the weight of the loss function can be adjusted by adopting the mode, so that the aim of optimizing the multi-task network model is fulfilled.

S240, after caching the attribute identification result of the specific duration, judging whether the attribute of the target task exists according to the judgment ratio on the cached attribute identification result; and if so, determining the target task attribute of the sample image.

For attribute identification results cached for a certain time, the caching time can be set according to needs, continuous N frames of images are selected, and if the caching time is larger than a certain set proportion, the attribute is judged to exist. For example: detecting the state of the mask: and the initial state is that the mask is not worn, and the sum M of the image mask state cache values of the previous N frames is divided by N from the Nth frame to obtain the mask wearing proportion, if the value is larger than a certain proportion, the mask wearing state is obtained. And similarly, judging the glasses state, the smoking state and the calling state.

In this embodiment, the multitask network model attribute recognition training process includes:

In the technical scheme, the output category numbers of the attribute identification tasks of whether to wear a mask, whether to smoke or not and whether to make a call are two categories;

The number of output categories of the recognition task of whether to wear glasses includes: wearing infrared blocking glasses, wearing ordinary glasses and not wearing glasses.

As is known, the network structure is various, and the required network structure can be selected, where the multitask network structure used here is SE-respet 18, and the initial architecture is adjusted, for example, the input size is set to 112 × 112, 7 × 7 convolution kernels (stride ═ 2) of the first convolution layer are replaced with 3 × 3 convolution kernels (stride ═ 1), the first maximum pooling layer is deleted, the second convolution kernel of the stage of the first respet block is changed to stride 2, and in each residual block, BN layers and dropouts are added, the structure is BN layer, Prelu, convolution, BN layer, Prelu layer, convolution, and drop layer, and the dropout ratio of the residual block is 0% to 20%.

The 7 × 7 output feature maps are processed by one squeeze multi-instance (osme) module, which mainly includes 4 compression and excitation networks, the 4 results output by this module are added to the previous feature maps to obtain feature maps of four tasks, the four feature maps are respectively pooled and then input to four attribute identification modules, and the outputs of the four attribute identification modules are whether to wear a mask, whether to wear glasses, whether to smoke and smoke APN, whether to call and call attention suggestion sub-network (APN). Whether a user wears a mask or wears glasses or not, whether smoking or not and whether a user makes a call or not are attribute identification tasks, and the smoking APN and the making-call APN can be used as the input of a secondary attention model and used for performing region cutting and amplification on an image. The attention suggestion subnetwork comprises two full connection layers connected in series and used for outputting the coordinates of the center point of the attention area and the side length of the attention area so as to perform the cropping and amplification processing of the feature map.

And inputting the amplified image into the first-level attention model, and judging the attributes again: whether smoking and smoking APN, whether calling and APN, and the APN generated by the first level attention model further cuts and magnifies the image, inputs the image into the second level attention model and judges the attribute again.

The first level attention neural network model and the second level attention neural network model have the same network structure as the main network model, and output the attribute identification result as described above. Therefore, whether smoking is performed or not and whether a call is made or not can have 3 attribute recognition results respectively, and finally, a model with the highest accuracy is selected as a final model (selected well during training) according to the test, and a unique attribute recognition result is output.

Here, the smoking and the calling are difficult to judge, so the attention mechanism is further applied to optimize the result, but the two tasks are not fixed. The attention mechanism can be applied to the tasks according to the needs of the users.

The output numbers of tasks aiming at the attributes of the mask, the infrared blocking glasses, the common glasses and the smoking and calling are respectively 2, 3, 2 and 2.

In the above technical solution, optionally, after the multitask network model is trained, the method further includes:

Fig. 3 is a schematic diagram of a training process of a multitask network model provided in an embodiment of the present application. As shown in fig. 3, a driving scene face recognition neural network is adopted to analyze the coding features, and if the coding features are higher than a set threshold value, a face frame coordinate value closest to the last result is given; if the output value is lower than the set threshold value, directly outputting: without the driver and end. And when the image is higher than the set threshold value, the human face frame closest to the last result is cut, and after the image is input into a main network after being cut, the attribute of the mask, the attribute of the glasses, the attribute of the smoking and the attribute of the calling can be identified.

Fig. 4 is a schematic diagram of an attention suggestion subnetwork provided by an embodiment of the application. As shown in figure 4 of the drawings,

assuming that the attention area of interest is square (i.e. the output of the APN), output x_t、y_t、l_t. Wherein, after passing through the softmax function, the probability p1 is output.

Wherein x is_t、y_t、l_tThe abscissa and ordinate of the center point of the coordinate domain representing the center point of the region, and the side length of the square are represented, respectively. Will be originalThe face is clipped and amplified according to the APN sub-network output, and local areas are generated: face scale2 as input to the new sub-network, this is the first level attention model.

The structural design of the new sub-network is the same as that of the main network, the classification network and the attention suggestion sub-network APN are output, the probability p2 is output through a softmax function, the APN is the same as the APN network, and attention area information is obtained through training extracted features, and the attention area information is a second-level attention model.

Clipping and zooming high local regions according to the output of the secondary attention model: and the face scale3 generates a face scale3, and the probability p3 is output through a softmax function. As a new input, into the same sub-network as the main network, and finally out of the classification network.

Through the 3-scale sub-networks, the initial face is changed into a multi-scale fine-grained region.

While the glasses and mask subtasks directly use the softmax function, representing the probability of the existence of this attribute for each element.

And adjusting the loss weight according to the attribute recognition results of the at least two tasks to complete the training of the multi-task network model.

On the basis of the technical scheme, optionally, different tasks can be selected according to needs to apply an attention mechanism, so that fine-grained feature classification loss and Lrank loss generated by the attention suggestion sub-network can be optionally determined.

The fine-grained classification loss, like other classification losses, may be softmax or softmax cross, etc., where focal loss is used.

The Lrank loss function is mainly used for enabling the classification effect of the second-level attention image to be better than that of the first-level attention image, and enabling the attention area to be changed into a more recognizable attention area in the first level of the attention area.

Specifically, the multi-task driver attribute recognition task is trained by using focal distance by using labeling information (mask state, glasses state, smoking state, and call state) of the driver attributes.

Since almost all attributes have the problem of overall sample proportion imbalance, the local loss function is used to reduce the weight occupied by negative samples. The focal length formula is as follows:

wherein

The balance factor is used for balancing the proportion unevenness of the positive and negative samples.

For the smoking and calling model, besides the classification loss focal loss generated by attribute classification, the inter-scale classification loss Lrank generated by fine-grained feature classification also exists.

Representing the classification output of the attention model at the t-1 layer in the correct category: probability of

Through experimentation, appropriate weights are set for the loss of each task. The overall loss is given by the following formula:

loss＝W1*loss_fl+W2*loss_fl+W3*(loss_fl+L_rank)+W4*(loss_fl+L_rank)+W5*L^np；

wherein, W1, W2, W3, W4 and W5 are respectively the weight of loss of each task, and the loss of each task is manually adjusted on the same order of magnitude through experiments.

The multitask network model provided by the scheme can realize the purpose of simultaneously carrying out multitask attribute recognition on the state of a driver, and the state of whether smoking or not and whether calling or not is recognized by adopting multiple granularities, so that the accuracy of model training is improved.

Fig. 5 is a schematic structural diagram of a multitask attribute identifying device for a driving scenario according to an embodiment of the present application. As shown in fig. 5, the multitask attribute identifying device of the driving scenario includes:

an image data acquisition module 510 for acquiring image data of a driver in a driving scene;

the identification module 520 is configured to input the image data into a pre-trained multi-task network model for attribute identification; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification;

And a multitask attribute identification result determining module 530, configured to determine a multitask attribute identification result of the driver according to an attribute identification output result of the multitask network model.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for multitask attribute identification of a driving scenario, the method comprising:

acquiring image data of a driver in a driving scene;

Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the multitask attribute identification operation of the driving scenario described above, and may also perform related operations in the multitask attribute identification method of the driving scenario provided in any embodiment of the present application.

The embodiment of the application provides equipment, and the multitask attribute recognition device of the driving scene provided by the embodiment of the application can be integrated into the equipment. Fig. 6 is a schematic structural diagram of an apparatus provided in an embodiment of the present application. As shown in fig. 6, the present embodiment provides an apparatus 600, comprising: one or more processors 620; the storage device 610 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 620, the one or more processors 620 are enabled to implement the method for multi-task attribute recognition of a driving scenario provided in an embodiment of the present application, the method includes:

acquiring image data of a driver in a driving scene;

Of course, those skilled in the art can understand that the processor 620 also implements the technical solution of the multitask attribute identification method for the driving scenario provided in any embodiment of the present application.

The apparatus 600 shown in fig. 6 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present application.

As shown in fig. 6, the apparatus 600 includes a processor 620, a storage device 610, an input device 630, and an output device 640; the number of the processors 620 in the device may be one or more, and one processor 620 is taken as an example in fig. 6; the processor 620, the storage 610, the input 630, and the output 640 of the apparatus may be connected by a bus or other means, such as the bus 650 in fig. 6.

The storage device 610 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the multitask attribute identification method for driving scenarios in the embodiments of the present application.

The storage device 610 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. In addition, the storage 610 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 610 may further include memory located remotely from the processor 620, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 630 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the apparatus. The output device 640 may include a display screen, speakers, etc.

The equipment provided by the embodiment of the application can adopt the multitask network model, simultaneously identifies various characteristics of the driver, and outputs an output result capable of integrally reflecting the driving state of the driver, so that the aim of improving the accuracy of driver state identification is fulfilled.

The multitask attribute identification device, the storage medium and the equipment for the driving scene, which are provided by the embodiments, can execute the multitask attribute identification method for the driving scene, which is provided by any embodiment of the application, and have corresponding functional modules and beneficial effects for executing the method. Technical details not described in detail in the above embodiments may be referred to a multitask attribute identification method of a driving scenario provided in any embodiment of the present application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A multitask attribute identification method for a driving scene is characterized by comprising the following steps:

acquiring image data of a driver in a driving scene;

2. The method of claim 1, wherein inputting the image data into a trained multitask network model for attribute recognition comprises:

preprocessing the image data;

caching an attribute identification result of the target task;

3. The method of claim 2, wherein pre-processing the image data comprises:

4. The method of claim 1, wherein the training of the multitasking network model comprises:

5. The method of claim 4, wherein prior to the training of the multitask network model, the method further comprises:

labeling multitask attributes to sample image data;

determining an initial architecture of the multitask network model;

a loss function is determined.

6. The method according to claim 4, wherein the number of output categories of the attribute recognition task of whether to wear a mask, whether to smoke, and whether to make a call are two categories;

7. The method according to claim 4, wherein the attribute identification modules of whether to wear a mask and whether to wear glasses are respectively used for attribute identification and outputting the results of whether to wear the mask and whether to wear glasses.

8. The method of claim 4, wherein for the smoking or smoking and calling attribute identification modules, each identification module comprises a subject identification module and an attention module;

the attention module comprises an attention suggestion subnetwork and an attention network, and is used for attribute identification and optimizing the attribute identification result of the main network.

9. The method of claim 8, wherein the attention suggestion sub-network mainly comprises two fully-connected layers connected in series for outputting the center point coordinates of the attention area and the side length of the attention area to perform the cropping and enlarging process on the original input image.

10. The method according to claim 8, wherein the attention network analyzes the cut and enlarged image through network to output new results of whether to smoke or not and whether to call, or cuts and enlarges the image again as required to perform network analysis and output recognition results.

11. The method according to claim 10, wherein the classification loss and Lrank loss are calculated by determining fine-grained features generated by the attention network so as to constrain the attribute analysis results of the main body network and the attention network.

12. The method of claim 4, wherein after the multitask network model training, the method further comprises:

13. An apparatus for multitask attribute recognition of a driving scenario, the apparatus comprising:

the image data acquisition module is used for acquiring image data of a driver from a driving scene video;

the identification module is used for inputting the image data into a pre-trained multi-task network model for attribute identification; the attribute identification task of the multitask network model comprises the following steps: mask identification, glasses identification, smoking identification and calling identification;

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for multitask attribute recognition of a driving scenario according to any one of claims 1-12.

15. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of multitask attribute recognition of a driving scenario according to any one of claims 1-12 when executing the computer program.