CN111723596B

CN111723596B - Gaze area detection and neural network training method, device and equipment

Info

Publication number: CN111723596B
Application number: CN201910204566.9A
Authority: CN
Inventors: 黄诗尧; 王飞; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2024-03-22
Anticipated expiration: 2039-03-18
Also published as: WO2020186883A1; JP7252348B2; JP2022517121A; KR20210102413A; CN111723596A

Abstract

The embodiment of the specification provides a method, a device and equipment for detecting a gazing area and training a neural network, wherein the method for training the gazing area for detecting the neural network comprises the following steps: inputting face images serving as training samples into a neural network at least, wherein the face images comprise gazing area category marking information corresponding to faces in the face images, and the marked gazing area categories belong to one of multiple types of defined gazing areas obtained by dividing specified space areas in advance; extracting features of the input face image through the neural network, and determining gazing area category prediction information of the face image according to the extracted features; determining the difference between the obtained gazing area category prediction information and gazing area category labeling information of the corresponding image; network parameters of the neural network are adjusted based on the differences.

Description

Gaze area detection and neural network training method, device and equipment

Technical Field

The present disclosure relates to computer vision, and in particular, to a method, apparatus, and device for gaze area detection and neural network training.

Background

With the rapid development of artificial intelligence and automobile industry, the application of artificial intelligence technology to mass production vehicles has become one of the most promising market directions. At present, among the artificial intelligence products that vehicle market demand is comparatively urgent, one of them product is just being used for monitoring the driving state of driver when driving, for example, whether the driver exists the distraction to in time remind the driver when the distraction, reduce the accident risk.

Disclosure of Invention

In view of this, it is an object of one or more embodiments of the present disclosure to provide a method, apparatus and device for gaze area detection and training of neural networks.

In a first aspect, there is provided a training method of a neural network for gaze area detection, the method comprising:

inputting face images serving as training samples into a neural network at least, wherein the face images comprise gazing area category marking information corresponding to faces in the face images, and the marked gazing area categories belong to one of multiple types of defined gazing areas obtained by dividing specified space areas in advance;

extracting features of the input face image through the neural network, and determining gazing area category prediction information of the face image according to the extracted features;

Determining the difference between the obtained gazing area category prediction information and gazing area category labeling information of the corresponding image;

network parameters of the neural network are adjusted based on the differences.

In connection with any one of the embodiments provided in the present disclosure, at least before inputting the face image as the training sample into the neural network, the method further includes: cutting at least one eye region in the face image to obtain at least one eye image; the inputting of at least the face image as the training sample into the neural network comprises: the face image and the at least one eye image of the face image are simultaneously input into the neural network.

In combination with any one of the embodiments provided in the present disclosure, the inputting the face image and the at least one eye image of the face image into the neural network simultaneously includes: adjusting the face image and each of the at least one eye image of the face image to the same predetermined size; simultaneously inputting the images with the adjusted sizes into the neural network; the feature extraction is performed on the input face image through the neural network, and the gazing area category prediction information of the face image is determined according to the extracted features, including: simultaneously extracting the characteristics of each input image through the neural network; and determining gazing area category prediction information of the face image according to the extracted features.

In combination with any one of the embodiments provided in the present disclosure, the inputting the face image and the at least one eye image of the face image into the neural network simultaneously includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size; the feature extraction is performed on the input face image through the neural network, and the gazing area category prediction information of the face image is determined according to the extracted features, including: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the gazing area category prediction information of the face image according to the fusion characteristics.

In combination with any one of the embodiments provided in the present disclosure, the feature extraction of the input face image via the neural network, and determining gaze area category prediction information of the face image according to the extracted feature, includes: respectively carrying out dot product operation on the characteristics extracted from the face image and the weights of all the categories to obtain an intermediate vector; the category weights respectively correspond to the multi-category defined gazing areas; the number of dimensions of the intermediate vector is the same as the number of categories of the multi-category defined gazing area; when the extracted characteristics are calculated by category weight dot products corresponding to the gazing area category label information of the face image, the cosine value of the vector included angle between the characteristics and the category weights is adjusted so as to increase the inter-category distance and reduce the intra-category distance; and determining the gazing area category prediction information of the face image according to the intermediate vector.

In connection with any one of the embodiments provided in the present disclosure, the specifying the spatial region includes: the spatial region of the vehicle.

In combination with any one of the embodiments provided in the present disclosure, the face image is determined based on an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

In a second aspect, there is provided a gaze region detection method, the method comprising:

a face area in the image acquired in the appointed space area is intercepted, and a face image is obtained;

inputting the face image into a neural network, wherein the neural network finishes training the face image set which comprises the gazing area category marking information in advance, and the marked gazing area category belongs to one of a plurality of types of defined gazing areas which are obtained by dividing the appointed space area in advance;

And extracting the characteristics of the input face image through the neural network, and determining the detection category of the gazing area corresponding to the face image according to the extracted characteristics.

In combination with any one of the embodiments provided in the present disclosure, after the capturing the face area in the image acquired in the specified spatial area to obtain the face image, the method further includes: cutting at least one eye region in the face image to obtain at least one eye image; the inputting the face image into the neural network comprises: simultaneously inputting the face image and the at least one eye image of the face image into the neural network, wherein the neural network is completed by adopting a face image set comprising gazing area category labeling information in advance and eye image training intercepted based on each face image in the face image set; the feature extraction of the input face image through the neural network comprises the following steps: and extracting the characteristics of the input face image and at least one eye image through the neural network.

In combination with any one of the embodiments provided in the present disclosure, the inputting the face image and the at least one eye image into the neural network simultaneously includes: adjusting each image in the face image and the at least one eye image to the same preset size, and inputting each image with the adjusted size into the neural network at the same time; the feature extraction is performed on the input face image and at least one eye image through the neural network, and the gaze area detection category corresponding to the face image is determined according to the extracted features, including: and simultaneously extracting the characteristics of the input face image and at least one eye image through the neural network, and determining the gazing region detection category corresponding to the face image according to the extracted characteristics.

In combination with any one of the embodiments provided in the present disclosure, the inputting the face image and the at least one eye image into the neural network simultaneously includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

the feature extraction is performed on the input face image and at least one eye image through the neural network, and the gaze area detection category corresponding to the face image is determined according to the extracted features, including: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the detection category of the gazing area corresponding to the face image according to the fusion characteristic.

In combination with any one of the embodiments provided in the present disclosure, before the capturing the face region in the image acquired in the specified spatial region, the method further includes: acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region; and respectively determining the image with the highest image quality score in the images respectively acquired by the cameras at the same time according to the image quality evaluation index, and taking the image as the image to be subjected to the interception processing.

In combination with any one of the embodiments provided in the present disclosure, before the capturing the face region in the image acquired in the specified spatial region, the method further includes: acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region;

the determining the gaze region detection category corresponding to the face image according to the extracted features includes: detecting the gazing area by adopting any gazing area detection method in the specification on each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment.

In combination with any one of the embodiments provided in the present disclosure, the image quality evaluation index includes at least one of: whether an eye image, the definition of an eye area in the image, the shielding condition of the eye area in the image and the opening and closing condition of the eye area in the image are included in the image.

In combination with any one of the embodiments provided in the present disclosure, before the capturing the face region image and the at least one eye region image in the image acquired in the specified spatial region, the method further includes: acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region;

the determining the gaze region detection category corresponding to the face image according to the extracted features includes: detecting the gazing area by adopting any one of the method to each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and determining most of the plurality of gaze region detection categories as gaze region detection categories at the time.

In combination with any one of the embodiments provided in the present disclosure, the image acquired in the specified spatial region includes: an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

In combination with any one of the embodiments provided in the present disclosure, after the determining, according to the extracted features, a gaze area detection category corresponding to the face image, the method further includes: determining the attention monitoring result of the person corresponding to the face image according to the detection result of the gazing region category; outputting the attention monitoring result and/or outputting distraction information according to the attention monitoring result.

In combination with any one of the embodiments provided in the present disclosure, after the determining, according to the extracted features, a gaze area detection category corresponding to the face image, the method further includes: determining a control instruction corresponding to a gazing area category detection result according to the gazing area category detection result; and controlling the electronic equipment to execute the operation corresponding to the control instruction.

In a third aspect, there is provided a training device for a neural network for gaze area detection, the device comprising:

the sample input module is used for inputting at least a face image serving as a training sample into a neural network, wherein the face image comprises gazing area category marking information corresponding to a face in the face image, and the marked gazing area category belongs to one of multiple types of defined gazing areas obtained by dividing a specified space area in advance;

The category prediction module is used for extracting the characteristics of the input face image through the neural network and determining the gazing area category prediction information of the face image according to the extracted characteristics;

the difference determining module is used for determining the difference between the obtained gazing area category prediction information and gazing area category labeling information of the corresponding image;

and the network adjustment module is used for adjusting network parameters of the neural network based on the difference.

In combination with any one of the embodiments provided in the present disclosure, the sample processing module is specifically configured to: before at least inputting a face image serving as a training sample into a neural network, cutting at least one eye region in the face image to obtain at least one eye image; and simultaneously inputting the face image and at least one eye image of the face image into the neural network.

In combination with any one of the embodiments provided in the present disclosure, the sample processing module, when configured to input the face image and at least one eye image of the face image into the neural network at the same time, specifically includes: adjusting the face image and each of the at least one eye image of the face image to the same predetermined size; simultaneously inputting the images with the adjusted sizes into the neural network; the category prediction module is specifically configured to: and simultaneously extracting the characteristics of each image in the input face image and at least one eye image through the neural network, and determining the gazing area category prediction information of the face image according to the extracted characteristics.

In combination with any one of the embodiments provided in the present disclosure, the sample processing module, when configured to input the face image and at least one eye image of the face image into the neural network at the same time, specifically includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

the category prediction module is specifically configured to: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the gazing area category prediction information of the face image according to the fusion characteristics.

In combination with any one of the embodiments provided in the present disclosure, the category prediction module, when configured to determine gaze area category prediction information of the face image according to the extracted features, includes: respectively carrying out dot product operation on the characteristics extracted from the face image and the weights of all the categories to obtain an intermediate vector; the category weights respectively correspond to the multi-category defined gazing areas; the number of dimensions of the intermediate vector is the same as the number of categories of the multi-category defined gazing area; when the extracted characteristics are calculated by category weight dot products corresponding to the gazing area category label information of the face image, the cosine value of the vector included angle between the characteristics and the category weights is adjusted so as to increase the inter-category distance and reduce the intra-category distance; and determining the gazing area category prediction information of the face image according to the intermediate vector.

In combination with any one of the embodiments provided in the present disclosure, the face image is determined based on an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the appointed space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

In a fourth aspect, there is provided a gaze region detection apparatus, the apparatus comprising:

the image acquisition module is used for intercepting a face area in the image acquired in the appointed space area to obtain a face image;

the image input module is used for inputting the face image into a neural network, wherein the neural network is finished by training face image collection comprising gazing area category marking information in advance, and the marked gazing area category belongs to one of a plurality of types of defined gazing areas obtained by dividing the appointed space area in advance;

And the category prediction module is used for extracting the characteristics of the input face image through the neural network and determining the gazing area detection category corresponding to the face image according to the extracted characteristics.

In combination with any one of the embodiments provided in the present disclosure, the image acquisition module is further configured to: cutting at least one eye region in the face image after cutting the face region in the image acquired in the appointed space region to obtain the face image, so as to obtain at least one eye image; the image input module is specifically configured to: simultaneously inputting the face image and the at least one eye image of the face image into the neural network, wherein the neural network is completed by adopting a face image set comprising gazing area category labeling information in advance and eye image training intercepted based on each face image in the face image set; the category prediction module, when used for extracting the characteristics of the input face image through the neural network, comprises: and extracting the characteristics of the input face image and at least one eye image through the neural network.

In combination with any one of the embodiments provided in the present disclosure, the image input module, when configured to input the face image and the at least one eye image into a neural network at the same time, specifically includes: adjusting each image in the face image and the at least one eye image to the same preset size, and inputting each image with the adjusted size into the neural network at the same time; the category prediction module is specifically configured to: and simultaneously extracting the characteristics of the input face image and at least one eye image through the neural network, and determining the gazing region detection category corresponding to the face image according to the extracted characteristics.

In combination with any one of the embodiments provided in the present disclosure, the image input module, when configured to input the face image and the at least one eye image into a neural network at the same time, specifically includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size; the category prediction module is specifically configured to: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the detection category of the gazing area corresponding to the face image according to the fusion characteristic.

In combination with any one of the embodiments provided in the present disclosure, the image acquisition module is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles; and respectively determining the image with the highest image quality score in the images respectively acquired by the cameras at the same time according to the image quality evaluation index, and taking the image as the image to be subjected to the interception processing.

In combination with any one of the embodiments provided in the present disclosure, the image acquisition module is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles;

the category prediction module is specifically configured to: detecting the gazing area by adopting a mode as any one of the description to each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment.

the category prediction module is specifically configured to: detecting the gazing area by adopting a mode as any one of the description to each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and determining most of the plurality of gaze region detection categories as gaze region detection categories at the time.

In combination with any one of the embodiments provided in the present disclosure, the image acquired by the image acquisition module in the designated spatial area includes: an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

In connection with any one of the embodiments provided in the present disclosure, the apparatus further comprises: a first class application module for: after the category prediction module determines the gazing area detection category corresponding to the face image according to the extracted characteristics, determining the attention monitoring result of the person corresponding to the face image according to the gazing area category detection result; and outputting the attention monitoring result, and/or outputting distraction prompt information according to the attention monitoring result.

In connection with any one of the embodiments provided in the present disclosure, the apparatus further comprises: a second class application module for: after the category prediction module determines a gazing area detection category corresponding to the face image according to the extracted features, determining a control instruction corresponding to the gazing area category detection result according to the gazing area category detection result; and controlling the electronic equipment to execute the operation corresponding to the control instruction.

In a fifth aspect, there is provided a training apparatus for a neural network for gaze area detection, the apparatus comprising a memory for storing computer instructions executable on a processor for implementing a training method for a neural network for gaze area detection as described in any of the present specification when the computer instructions are executed.

In a sixth aspect, a gaze area detection apparatus is provided, the apparatus comprising a memory for storing computer instructions executable on a processor for implementing a gaze area detection method as described in any of the specification when the computer instructions are executed.

In a seventh aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements a training method for a neural network for gaze area detection as described in any of the present specification, and/or implements a gaze area detection method as described in any of the present specification.

According to the method, the device and the equipment for detecting the gazing area and training the neural network, the neural network is trained according to the face image serving as a training sample and the gazing area category marking information corresponding to the pre-marking, so that the corresponding gazing area of the face image can be directly predicted according to the neural network.

Drawings

In order to more clearly illustrate one or more embodiments of the present specification or the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described, it being apparent that the drawings in the following description are only some of the embodiments described in one or more embodiments of the present specification, and that other drawings may be obtained from these drawings without inventive faculty for a person of ordinary skill in the art.

Fig. 1 is a training method of a neural network for gaze area detection according to at least one embodiment of the present disclosure;

FIG. 2 is an example of a plurality of gaze areas provided by at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a neural network structure according to at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a neural network according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of another neural network according to at least one embodiment of the present disclosure;

FIG. 6 is a flowchart of neural network training corresponding to FIG. 5;

FIG. 7 is a schematic diagram of an eye image acquisition provided in at least one embodiment of the present disclosure;

FIG. 8 is a further neural network training process provided in at least one embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the neural network structure corresponding to FIG. 8;

fig. 10 is a flow chart of a gaze area detection method provided in at least one embodiment of the present disclosure;

fig. 11 is a schematic diagram of a neural network application scenario provided in at least one embodiment of the present disclosure;

FIG. 12 is an alternative output of gaze area detection provided by at least one embodiment of the present disclosure;

fig. 13 is a training device for a neural network for gaze area detection according to at least one embodiment of the present disclosure;

Fig. 14 is a view area detecting device according to at least one embodiment of the present disclosure;

fig. 15 is a view area detecting device according to at least one embodiment of the present disclosure;

fig. 16 is a training device for a neural network for gaze area detection provided in at least one embodiment of the present disclosure;

fig. 17 is a view area detecting apparatus according to at least one embodiment of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive effort by one of ordinary skill in the art, are intended to be within the scope of the present disclosure.

At least one embodiment of the present disclosure provides a training method for a neural network for gaze area detection, as shown in fig. 1, where fig. 1 shows a flow of the training method, and may include the following processing:

In step 100, at least a face image as a training sample is input into a neural network.

In this step, the face image may be an image to be collected in a specific gaze area detection scene, and the gaze area detection scene may be numerous, for example, the gaze area detection scene of the person is detected to automatically replace the person to control the intelligent device; for another example, the preference or willingness of the person may be obtained by detecting the gaze area of the person; for another example, the driver's driving concentration may be determined by detecting the driver's gaze area. Etc. Other gaze area detection application scenarios are not described in detail. In different scenes, face images of the target person can be acquired, wherein the face images comprise the faces of the target person.

The neural network may be various, for example, a convolutional neural network, a deep neural network, or the like. The present embodiment is not limited to a specific network structure.

In addition, the face image input to the neural network comprises gazing area category marking information corresponding to the face in the face image, and the marked gazing area category belongs to one of multiple types of defined gazing areas obtained by dividing the specified space area in advance. For example, in the above-listed various gazing area detection scenarios, there may be a specific spatial area in advance, and gazing area detection of a face image, that is, detection of where the area gazed by the face in the image is in the specific spatial area, and different positions may correspond to different meanings. For example, different gaze locations may represent different driving attentiveness of the driver; for another example, different gaze locations may represent different preference willingness of the detection target person, etc.

In order to distinguish different meanings, the specified spatial region may be subdivided into a plurality of different sub-regions, each of which may be referred to as a gaze region. The fixation areas may be distinguished by different identifiers, for example, fixation area a and fixation area B; alternatively, the gaze area 5, the gaze area 6, and the like may be used. The manner in which different gaze areas are marked differently is not limited in this embodiment. The above listed A, B, 5, 6, etc. may be referred to as gaze region categories. The definition of the gazing region category can facilitate training of the neural network, and the pre-labeled category can be used as a label for network category prediction. In this step, the face image input to the neural network may include gazing area category labeling information corresponding to the image, that is, gazing area categories of real gazing of the face in the face image.

In step 102, feature extraction is performed on the input face image through the neural network, and gaze region category prediction information of the face image is determined according to the extracted features.

In this step, the neural network may perform feature extraction on the input face image, where the extracted features include multiple image features of the face image. And may output predicted gaze region category prediction information based on these extracted features. The gaze region category prediction information output here may be a predefined category of each gaze region, for example, the category may be represented by letters or numbers, and the predicted gaze region category prediction information is "5", that is, gaze region 5, after a face image extracts features through a neural network. Of course, the present embodiment does not limit the manner of representing the category of the gaze area.

In step 104, a difference between the obtained gaze region category prediction information and gaze region category annotation information for the respective image is determined.

For example, the difference between the gaze region category prediction information and the gaze region category annotation information for the respective image may be determined by a loss function. The present embodiment is not limited to a specific form of the loss function.

In step 106, network parameters of the neural network are adjusted based on the differences. For example, the network parameters of the neural network may be adjusted by a gradient back-propagation method.

According to the training method of the neural network for detecting the gazing area, the neural network is trained according to the face image serving as a training sample and the corresponding gazing area category marking information corresponding to the pre-marking, so that the corresponding gazing area of the face image can be directly predicted according to the neural network; in addition, compared with the traditional gazing sight line detection, the gazing area detection method has the advantages that gazing area detection of face images is carried out through the neural network, the gazing area is wider than the gazing sight line range, even if the sight line of a driver is slightly deviated or changed, the detection result cannot be influenced, the implementation is simpler, the fault tolerance of the detection can be improved, the application of the gazing area detection result can be wider, and the gazing area type detection result can be used for various scenes.

In the following description, a training method of the neural network for gaze area detection will be described in more detail. In the following, the training method is described by taking a vehicle driver attention monitoring scene as an example, the face image input to the neural network is determined based on the image acquired by the driving area in the spatial area of the vehicle, for example, an image may be acquired for the driving area, and the face area in the image may be cut to obtain the face image of the vehicle driver. In the scene of the vehicle driver attention monitoring, the predefined gaze area is a plurality of areas that the driver may look at while driving.

It will be appreciated that other scenarios may apply the same training method as well, except that the face image input to the neural network may vary from application scenario to application scenario. In addition, in different scenes, the designated space region where the gazing region is located is also different, and the gazing region may be a space region of a vehicle or a space region of a non-vehicle, for example, may be a device region of a certain intelligent device; even a spatial region of the vehicle, in a non-driver attentiveness-monitored scenario, may be other spatial regions of the vehicle than the region illustrated in fig. 2.

In order to reduce traffic accidents and improve the guarantee of driving safety in vehicle driver attention monitoring applications, one of the possible measures is to monitor the attention area of the driver to determine whether the driver is distracted. The 'gazing area' can comprise multiple types of defined gazing areas obtained by dividing a designated space area in advance, wherein the designated space area can be a space area possibly gazed by a driver in a vehicle on which the driver is riding, the space area can be determined according to the vehicle structure, and the control area can be divided to obtain multiple gazing areas. The gaze area category may be defined by identifying the plurality of gaze areas with different symbols, e.g. the category defining a certain gaze area is B.

Referring to fig. 2, fig. 2 illustrates several gaze areas of the driver. For example, the "gaze area" includes, but is not limited to, any of the following: a left front windshield 21, a right front windshield 22, an instrument panel 23, a left rear view mirror 24, a right rear view mirror 25, and the like. It should be noted that the above are only exemplary ones, in practical implementations, the number of fixation areas may be increased or decreased, and the extent of the fixation areas may be scaled, which may be determined according to the actual use requirements. Illustratively, in addition to the above-described regions, fig. 2 may include any of the following: the inside mirror region 26, center console region 27, visor region 28, shift lever region 29, under-steering-wheel region 30, and also the co-driver region, glove compartment region in front of the co-driver, and the like.

In judging whether the driver is distracted based on the detection of the driver's gaze area, the following may be made: for example, during normal driving, the driver will typically focus on the front windshield 21, and if it is monitored that the driver's gaze area is always focused on the dashboard 23 for a period of time, it may be determined that there is distraction of the driver. It can be seen that monitoring the change of the driver's gaze area is very important to ensure driving safety.

Based on the above, an end-to-end neural network for detecting a gaze area may be provided, the neural network may be used for detecting a gaze area of a driver in a vehicle, an input of the neural network may be a face image of the driver acquired by a camera, and an output of the neural network may be directly the gaze area of the driver. For example, the driver's gaze area is the front right windshield 22, and the gaze area output by the neural network may be indicated by an area number, e.g., the front right windshield 22 may be indicated by a number "B". The end-to-end neural network can detect and obtain the gazing area of the driver more quickly.

In the following description, training of a neural network for driver's gaze area detection and actual application of the neural network will be described separately.

Training of neural networks for driver gaze areas

[ preparation of samples ]:

prior to training the neural network, a sample set may first be prepared, which may include: training samples for training the neural network, and test samples for testing the neural network.

For example, a sample may be collected as follows:

each gaze area to be identified is predetermined:

for example, by way of example, ten gaze regions may be determined. The ten gaze areas include the gaze area shown in fig. 2, and also include a plurality of other areas such as a gear lever, a co-driver's seat, a center console, and the like. That is, the purpose of training the neural network is to cause the neural network to automatically recognize which of the ten regions the gazing region corresponding to the input image is.

The ten gaze areas may be numbered. For example, gear lever number "a", co-driver seat number "B", right front windshield number "C", etc., which is used to facilitate subsequent neural network training and testing. The above-mentioned numbers of "A, B" and the like may be referred to as "categories" of the gazing area in the following description.

After determining the respective gaze areas and the corresponding category representations, the driver position of the captured person in the vehicle may be indicated and gazes at the ten gaze areas in sequence. Each time the person to be captured looks at one of the gazing areas, the camera installed in the vehicle can capture the face image of the driver corresponding to this position. Each gaze area may correspond to a plurality of acquired face images.

The category of each gazing area and the face image acquired correspondingly to the gazing area can be established with a corresponding relation. The "category" in each corresponding relation can be used as gazing area category labeling information corresponding to the face image, namely, each face image is an image acquired when a driver gazes at the gazing area corresponding to the category labeling information. Finally, the large number of samples collected may be divided into training and testing sets. The training samples in the training set are used for training the neural network, and the test samples in the test set are used for testing the neural network. Each training sample may include: a face image of the driver and gazing area category marking information corresponding to the face in the face image.

[ determining neural network structure ]:

in at least one embodiment of the present description, a neural network for detecting a driver's gaze area may be trained. For example, a convolutional neural network (Convolutional Neural Networks, simply: CNN) or a deep neural network, etc. The specific structure of the neural network is not limited, and in an alternative implementation manner, the neural network may include: the network units such as the convolution Layer (Convolutional Layer), the Pooling Layer (Pooling Layer), the nonlinear Layer (ReLU Layer), the full connection Layer (Fully Connected Layer) and the like are stacked according to a certain mode.

Fig. 3 illustrates a network structure of a CNN to which at least one embodiment of the present disclosure may be applied, and it should be noted that fig. 3 is merely an example of a network structure of a CNN, which is not limited in practical implementation.

As shown in fig. 3, the CNN 300 may extract features from the input image 302 through a Feature extraction layer 301, where the Feature extraction layer 301 may include, for example, a plurality of convolution layers and a pooling layer, and the convolution layers and the pooling layer may be alternately connected together, where the convolution layers may extract different features in the image through a plurality of convolution kernels, respectively, to obtain a plurality of Feature maps (Feature maps), and the pooling layer may perform an operation of locally averaging and downsampling the Feature maps after the convolution layers, so as to reduce the resolution of the Feature maps. As the number of convolution layers and pooling layers increases, the number of feature maps increases and the resolution of the feature maps decreases.

Each feature in the feature map finally extracted by the feature extraction layer 301 is tiled and spread, so that a feature vector 304 can be obtained and used as an input vector of the full connection layer 305. The fully connected layer 305 may include a plurality of hidden layers. The purpose of the CNN is to identify which gaze area corresponds to the input image 302, so a classification vector 307 is output by the classifier at the end of the full-connection layer 305, where the classification vector 307 includes probabilities that the input image belongs to each gaze area. The fully-connected layer 305 may convert the feature vector 304 into the input vector 306 of the classifier through the above-mentioned multiple hidden layers, where the number of elements included in the input vector 306 is the same as the number of elements of the classification vector 307, and is the number of gaze regions to be identified.

Before training the CNN, some network parameters may be set, for example, the number of convolution layers and pooling layers included in the feature extraction layer 301, the number of convolution kernels used for each convolution layer, the size of the convolution kernels, and the like. And for parameters such as the value of the convolution kernel, the weight of the full connection layer and the like, the self-learning can be performed through the iterative training of the CNN network. The specific CNN network training method may employ a conventional training manner, which will not be described in detail.

The neural network training may begin based on the preparation of training samples and the initialization of the CNN network structure. A neural network of the available driver gaze area may be trained as follows:

[ training neural network one ]

The embodiment can train an end-to-end neural network for detecting the gaze area of the driver.

For example, referring to the example of fig. 4, the structure of the neural network (for example, CNN) may be as shown in fig. 3, and the input of the CNN network may be a face image in the training sample.

The face image may be an entire face image of the driver collected by a camera installed in the vehicle, and the entire face image may be an image with a larger shooting range, for example, may include other parts other than the shoulder, the neck, and the like of the face. Then the whole face image can be cut through face detection, so that a face image which basically only comprises the face of the driver is obtained.

The input face image can extract image characteristics through a neural network, and output category prediction information of a gazing area corresponding to the face image according to the image characteristics, namely, the face image is obtained by shooting and collecting when predicting which category of gazing area the driver gazes. The gazing area corresponding to the face image is one of a plurality of gazing areas preset by the riders of the vehicles according to the vehicle structure, and the category is used as the identification of the gazing area.

For example, after the processing of the convolution layer, the pooling layer, and the full connection layer in the CNN network, a classification vector may be output, where the classification vector may include probabilities that the input image respectively belongs to each gaze region. As shown in fig. 4, it is assumed that "a", "B", "C" … … "J" are categories of ten gazing areas (in this embodiment, letters are taken as an example, other category expression methods may be adopted in practical implementation, for example, numbers or names of categories may be directly used as the category prediction result), and "0.2" indicates that "the probability that the input image belongs to gazing area a is 20%", and "0.4" indicates that "the probability that the input image belongs to gazing area J is 40%". Assuming that the probability corresponding to J is highest, then 'J' is the category prediction information of the gazing area of the CNN input face image. And assuming that the category of the gazing area to which the face image truly belongs is C, it is obvious that there is a difference between the category prediction information (J) and the category labeling information (C). The loss (loss) value of the loss function can be derived from the difference between the gaze area prediction value and the tag value.

During training of the neural network, the training sample can be divided into a plurality of image subsets (batch), one image subset is sequentially input to the neural network in each iteration training, and network parameters of the neural network, such as the weight of a full-connection layer, the value of a convolution kernel and the like, are adjusted by combining the loss values of the prediction results of all samples in the training sample included in the image subset and returning the loss values back to the neural network. After the iterative training is completed, the next image subset can be input into the neural network to perform the next iterative training. The different image subsets comprise training samples that are at least partially different. When a predetermined training end condition is reached, a training-completed CNN network may be obtained as a neural network for detecting the driver's gaze area. The predetermined training ending condition may be, for example, that the loss value is reduced to a certain threshold value, or that a predetermined number of iterations of the neural network is reached.

The neural network obtained by training in this embodiment may take the face image of the driver as an input of the neural network, and the neural network output may be a gaze area predicted value obtained according to the classification vector of the classifier, for example, the gaze area of the class D is gazed by the driver. The nerve network can quickly identify the gazing area of the driver, and whether the driver is distracted or not can be conveniently judged according to the gazing area.

Training neural network two

In order to improve accuracy of gaze region detection, in this embodiment, input to the neural network is improved.

Referring to the example of fig. 5, the inputs to the neural network may include: face images and eye images. The human eye image can be obtained by cutting out a human face from a picture of the driver, and the eye image can be obtained by cutting out the human face image. For example, for a face image, key points of the face may be detected, for example, the key points of the face may include eye key points, nose key points, eyebrow key points, and the like, and eyes on the face in the image may be cut according to the detected key points, so as to obtain an eye image of the driver, where the eye image includes eyes of the driver.

It should be noted that the eye image may include: at least one of the left eye image and the right eye image. For example, the input of the neural network may include "face image+left eye image", or may also include "face image+right eye image", or may also include "face image+left eye image+right eye image", taking the simultaneous input of face image and left and right eye images as an example in fig. 5. By inputting the face image and the eye image into the neural network for training, the features of the face and the eyes can be learned at the same time, and the diversity and the characterization capability of the features are increased, so that the trained neural network can obtain more accurate detection results of the gazing region category.

Fig. 6 is a neural network training flow diagram corresponding to fig. 5, as shown in fig. 6, which may include:

in step 600, face keypoints in the face image are detected.

For example, key points of the face, such as eye key points, may be detected.

In step 602, the face image is clipped according to the face key points, so as to obtain an eye image including eyes in the image.

For example, the eye image includes the eyes of the driver. The eye images cut out in this step may include a left eye image and a right eye image of the driver. Referring to fig. 7, fig. 7 illustrates a left eye image 72 and a right eye image 73 that are clipped from a face image 71.

In step 604, the face image and the eye image are adjusted to the same predetermined size. The step is to adjust the face image, the left eye image and the right eye image of the driver to the same size.

In step 606, the resized face image and the eye image are simultaneously input to the same feature extraction layer of the same neural network.

In step 608, the feature extraction layer of the neural network extracts features in the face image and the eye image at the same time, so as to obtain an extracted feature vector, where the feature vector includes: features in the face image and the eye image.

For example, the feature extraction layer of CNN may learn features of a face and left and right eyes at the same time, and extract feature vectors including both face image features and eye image features. Illustratively, the CNN may extract a plurality of Feature maps through a plurality of convolution layers, pooling layers, and the like, where the plurality of Feature maps include facial image features and eye image features, and obtain the Feature vector according to the plurality of Feature maps.

In step 610, based on the feature vector, gaze region category prediction information of the driver is determined.

For example, the feature vector may be converted into another intermediate vector by a fully connected layer in the CNN, the number of dimensions of which is the same as the number of categories of the gaze area. And, according to the intermediate vector, the probability that the face image of the driver belongs to each category of the gazing area is calculated through a classification algorithm, and the category corresponding to the maximum probability is used as the category prediction information. The intermediate vector may be, for example, the input vector 306 of the classifier.

In step 612, network parameters of the neural network are adjusted based on the differences between the category prediction information and the corresponding category annotation information for the gaze region.

The category labeling information comprises categories of a certain gazing area in a multi-category definition gazing area of face gazing in the image. For example, the loss function value for the training sample may be calculated based on the difference between the category prediction information and the category label information, and the network parameters of the CNN may be adjusted based on the individual loss function values for a set of training samples.

The neural network obtained by training in the embodiment enables the neural network to learn the characteristics of the face and the eyes at the same time by taking the face image and the eye image as the input of the neural network, and the characteristic of the eyes is the very relevant part of the attention detection, and the characteristic capacity of the extracted characteristic in the aspect of the attention can be enhanced by combining the face image and the eye image, so that the detection accuracy of the neural network about the category of the gazing area is improved; the neural network is simple to implement.

(training neural network III)

The present embodiment trains a neural network of another structure. Fig. 8 illustrates another neural network training process that may be described in connection with the neural network structure illustrated in fig. 9.

In step 800, face keypoints in the face image are detected.

For example, the key points of the face, such as the eye key points, may be detected by face key point detection.

In step 802, the face image is cropped according to the face key points (e.g., eye key points) to obtain an eye image including eyes of the person in the image.

For example, the cropped eye image may comprise a left eye image and/or a right eye image of the human eye in the image.

In step 804, the face image, the left eye image, and the right eye image are extracted and branched corresponding to different features included in the input neural network.

In this embodiment, the sizes of the face image and the eye image may not be adjusted as in fig. 6, but the size of the trimmed image may be kept and input into the neural network, that is, the sizes of the face image and the eye image input into the neural network are different at this time, and the face image and the eye image with different sizes are respectively input into different feature extraction branches of the neural network. As one structure illustrated in fig. 9, a face image, a left-eye image, and a right-eye image, which are the same in size and larger in size than the left-eye image and the right-eye image, may be input into the three feature extraction branches, respectively. For example, the three feature extraction branches may include multiple convolution layers, pooling layers, etc. for extracting image features, and the structures of the three feature extraction branches may be the same or different, e.g., may include different numbers of convolution layers, or have different numbers of convolution kernels.

In step 806, a feature extraction branch of the neural network extracts features in the face image to obtain an extracted face feature vector; and simultaneously, extracting the features in the eye image by other feature extraction branches of the neural network to obtain an extracted eye feature vector.

For example, the three feature extraction branches may learn features in the respective images, where the feature extraction branch 1 may extract a face feature vector 91 from a face image, the feature extraction branch 2 may extract a left eye feature vector 92 from a left eye image, the feature extraction branch 3 may extract a right eye feature vector 93 from a right eye image, and the left eye feature vector 92 and the right eye feature vector 93 may be referred to as eye feature vectors.

In step 808, the face feature vector and the eye feature vector are fused, so as to obtain a fused feature vector, i.e. a fused feature. For example, in fig. 9, the face feature vector 91, the left eye feature vector 92, and the right eye feature vector 93 may be fused to obtain a fused feature vector 94. The feature fusion may be to splice and combine multiple vectors together, and the order of the combination is not limited.

In step 810, based on the fusion feature vector, gaze region category prediction information of the driver is obtained.

For example, the fusion feature vector may be converted into another intermediate vector by a fully connected layer in the CNN, the number of dimensions of which is the same as the number of categories of the gaze area. And, according to the intermediate vector, the probability that the face image of the driver belongs to each category of the gazing area is calculated through a classification algorithm, and the category corresponding to the maximum probability is used as the category prediction information.

In step 812, network parameters of the neural network are adjusted based on the differences between the category prediction information and the corresponding category annotation information for the gaze region.

The category labeling information refers to a face image of the driver collected when the driver gazes at the gazing area identified by the category labeling information, that is, the face image identified in step 800. For example, the loss function of the training samples may be calculated based on the differences between the category prediction information and the category label information, and the network parameters of the neural network may be adjusted based on the individual loss functions of a set of training samples.

The neural network obtained through training in the embodiment extracts the features in the face image and the eye image respectively through different feature extraction branches in the neural network, so that the original size of the face image and the eye image or the quality loss caused by image size adjustment can be reduced or even avoided, the features of the face and the eyes can be extracted more accurately, the characteristic capacity of the features in the attention aspect is enhanced in a manner of fusing the face features and the eye features, and the category detection of the gazing area based on the fused features is more accurate.

In the training method of the neural network for gaze area detection according to any of the embodiments of the present disclosure, the neural network may distinguish feature vectors belonging to different types of gaze areas within the feature space by a classification algorithm. In addition, it may happen that feature vectors extracted from training data belonging to different gazing areas may be very close to each other in the feature space, and in actual use, the feature vectors extracted from training data may have a larger distance from the center of the real gazing area than from the center of the adjacent gazing area in the feature space, thereby causing erroneous judgment. Therefore, in order to improve the quality of the feature vector extracted by the network, the dot product operation can be performed on the image features (for example, the feature vector including the face image features and the eye image features) extracted by the neural network and each category weight respectively to obtain an intermediate vector; the category weights respectively correspond to the categories of the gazing area; the number of dimensions of the intermediate vector is the same as the number of categories of the gaze area. And when the class weight dot product operation corresponding to the class labeling information of the image features and the face image is carried out, adjusting the cosine value of the vector included angle between the image features and the class weights so as to increase the inter-class distance and reduce the intra-class distance.

For example, a large margin softmax algorithm may be used to improve the quality of feature vectors extracted by the network, enhance the compactness of the features extracted by the network, and improve the accuracy of the final gaze region classification, as shown in the following formula (1): in this algorithm, li represents the loss value of the loss function of sample i, θ _yi Is W _yi And X is _i Included angle between W _yi May be a category weight, X, corresponding to each gaze region category _i May be the image features extracted by the CNN and obtained from the feature map, yi may be the category of each gaze region, i may be the i-th training sample,the method can be called as the intermediate vector, and when j=yi, the category weight corresponding to the category labeling information of the image feature and the face image is represented to be dot product.

In the above description, taking the driver attention monitoring scenario as an example, the training method of the neural network for gaze area detection is described in detail, and two possible neural network structures are listed. In other scenes than the driver's attention monitor scene, the neural network used in the other scenes may be trained in the same manner as long as the face images acquired in the respective scenes and the gaze areas predefined in the respective scenes are employed.

In the following description, how to apply the trained neural network for gaze area detection will be described. Of course, the neural network used for detecting the gaze area may be trained by a method other than the training method described in the present specification. Fig. 10 illustrates a gaze region detection method, which may include, as shown in fig. 10:

in step 1000, a face region in an image acquired in a specified spatial region is intercepted, and a face image is obtained. For example, the image acquired by the specified spatial region may be a larger-range image including a face, and the face region may be truncated from the image, thereby obtaining a face image.

In step 1002, the face image is input into a neural network, where the neural network performs training by using face image sets including gaze region category labeling information in advance, and the labeled gaze region category belongs to one of multiple types of defined gaze regions obtained by dividing the specified spatial region in advance.

The neural network of this embodiment may be a neural network obtained by using the training method shown in fig. 1, and the face image obtained in step 1000 is input into the neural network.

In step 1004, feature extraction is performed on the input face image through the neural network, and a gaze area detection category corresponding to the face image is determined according to the extracted features.

In this step, a gaze region corresponding to a face in the face image may be predicted by the neural network, and the predicted gaze region may be referred to as a gaze region detection class. Also, the gaze area detection category may be represented in various manners, such as letters, numbers, names, etc., and the present embodiment is not limited thereto.

According to the gazing area detection method, the gazing area detection category corresponding to the face image can be directly predicted and output through the pre-trained neural network, compared with the traditional gazing sight detection mode, the gazing area detection mode is wider than the gazing sight range, even if the sight of a driver is slightly deviated or changed, the detection result cannot be influenced, the implementation is simpler, the detection fault tolerance can be improved, the application of the gazing area detection result can be wider, and the gazing area category detection result can be used for various scenes.

In the following, a driver attention monitor scenario will be taken as an example, how the neural network trained in this scenario is applied, and it will be understood that the neural network trained in other scenarios may also be applied in the same way.

Application of neural network in driver's gaze area

In applying the neural network to detect the region of gaze of the driver, the neural network used may be any of the trained neural network structures described above.

Referring to the neural network application scenario illustrated in fig. 11, a camera 1102 may be installed in a vehicle 1101 on which a driver is riding, the camera 1102 may collect an image including a face of the driver, the image 1103 may be transmitted to an image processing device 1104 in the vehicle, and a pre-trained neural network may be stored in the image processing device 1104. Referring to fig. 11, the image 1103 may be preprocessed prior to inputting into the neural network. For example, the face may be detected through a neural network to obtain a face image 1105, and the face image 1105 may be cropped to obtain a left-eye image 1106 and a right-eye image 1107. The face image 1105, the left eye image 1106, and the right eye image 1107 may be simultaneously input to a pre-trained neural network 1108, and output a detection category of a gaze area of a driver in the vehicle through the neural network 1108, for example, the driver is gazing at a gaze area of category B, the area of category B being a right rearview mirror. When the face image, the left eye image 1106, and the right eye image 1107 are simultaneously input to the neural network 1108, the face image, the left eye image 1106, and the right eye image 1107 may be adjusted to the same predetermined size and then input to the same network, or may be respectively input to different feature extraction branches in different sizes, which is described in detail above, and will not be described in detail.

Referring to fig. 12, fig. 12 illustrates an alternative output example of a gaze region category detection result. The category identifier of the gaze area detection category indicating the neural network prediction output is "5", and the "5" may be a center console in the vehicle. In practical implementation, a camera is disposed in the vehicle on which the driver in fig. 12 rides, and the camera may acquire an image of the driver similar to that shown in fig. 12, and acquire the image to obtain a face image of the driver, for example, the area shown in block 1201 in fig. 12 may be an area corresponding to the face image. The face image can be input into the neural network in the image processing device 1104, and the neural network outputs the predicted gazing area detection category '5', so that the real-time performance is good, and the gazing area of the driver can be detected quickly and accurately.

In addition, for different gazing areas, the driver generally has different head gestures, if only the image of a single camera is used, no matter where the camera is installed in the vehicle, the situation that one eye or even both eyes are invisible due to the rotation of the head of the driver may occur, so that the judgment of the final gazing area is affected. In addition, it is also common for a driver wearing glasses to just shoot the situation that the reflection of the lens by the camera at a certain angle causes the part of the glasses to be partially or completely blocked. In order to solve the above problems, a method of installing a plurality of cameras at different positions can be adopted.

For example, multiple cameras 1102 may be mounted within a driver's occupied vehicle 1101, the multiple cameras 1102 may acquire images of the same driver from different angles, which may be images acquired for a driving area in a spatial area of the vehicle. Moreover, the acquisition time of a plurality of cameras can be synchronized, or the acquisition time of each frame of image can be recorded, so that a plurality of images acquired by different cameras at the same time for the same driver can be acquired during subsequent processing.

It will be appreciated that in other scenarios than driver attentiveness monitoring scenarios, multiple cameras may be deployed within a specified spatial region of a respective scenario to capture images within a particular sub-region within the specified spatial region. For example, in a smart device controlled scenario, the particular sub-region may be one region in which the target person controlling the smart device is located. By acquiring an image of the specific sub-area, an image comprising the face of the person can be obtained and the gaze area of the person detected accordingly.

Still taking the driver attention monitoring scene as an example, after a plurality of images of the same driver acquired by a plurality of cameras at the same moment are acquired, the gazing area of the driver can be comprehensively determined according to the plurality of images, and various modes are also available. Three ways are illustrated below, but the implementation is not limited thereto:

Mode one: the image with the highest image quality score in the images acquired by the cameras at the same time can be respectively determined according to the image quality evaluation indexes, and the image with the highest quality score is intercepted, for example, a face area in the image is intercepted to obtain a face image.

Wherein the image quality evaluation index may include at least one of: whether an eye image, the definition of an eye area in the image, the shielding condition of the eye area in the image and the opening and closing condition of the eye area in the image are included in the image. For example, the image may include a clear eye image, and the eye area is not blocked, and the image with the eyes fully open in the eye area is used as the image with the highest image quality score, and the image is input into the neural network after being intercepted.

Mode two: the gaze area detection method described in any embodiment of the present disclosure may be used to detect gaze areas for each image of different angles acquired by each camera at the same time, so as to obtain multiple gaze area detection categories. And selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment. The image quality evaluation index is as described in the first mode.

Mode three: the gaze area detection method described in any embodiment of the present disclosure may be used to detect gaze areas for each image of different angles acquired by each camera at the same time, so as to obtain multiple gaze area detection categories. And selecting a majority of the plurality of gaze region detection categories from the plurality of gaze region detection categories as the gaze region detection category at the time. For example, among the categories corresponding to the six face images, five categories that are all identified as gaze areas are C, and then C can be used as the final identified category.

After the recognition of the gaze area of the driver, further applications may be made in accordance with the gaze area. For example, the attention monitoring result of the person corresponding to the face image may be determined according to the gazing area category detection result. For example, the gaze area category detection result may be a gaze area detection category within a preset period of time. For example, the gaze area category detection result may be "the gaze area of the driver is always the area B in the period of the preset length", and if the area B is the front windshield, it is indicated that the driver is more attentive to driving. If the area B is a glove box area in front of the co-driver, it is indicated that the driver is likely to be distracted and inattentive.

After the attention monitor result is detected, the attention monitor result may be output, for example, "driving is focused" may be displayed in a certain display area in the vehicle. Or, the distraction prompt information can be output according to the attention monitoring result, and the attention risk and attention can be output on a display screen for prompting the driver. Of course, at the time of the specific display, at least one of the attention monitor result and the distraction prompt information may be displayed.

The attention monitoring result of the person is determined or the distraction prompt information is output according to the detection of the attention area category, so that the attention monitoring method has important help for the attention monitoring of the driver, the condition that the attention of the driver is not concentrated can be effectively detected, prompt is timely carried out, and further the accident occurrence is reduced.

In the above description, taking the driver attention monitoring scene as an example, the detection of the gaze area may have multiple uses. The following examples several possible applications, but are not limited thereto:

for example, vehicle-to-machine interaction control based on gaze area detection may be performed. Some electronic devices, such as a multimedia player, can be arranged in the vehicle, and the multimedia player can be automatically controlled to start a playing function by detecting the gazing area of a passenger in the vehicle.

For example, a face image of a passenger is obtained by shooting through a camera disposed in a vehicle, and a detection result of a gazing area category is detected through a pre-trained neural network, for example, the detection result may be [ in a period of time T, the gazing area of the passenger is always an area where a "gazing on" option on a certain multimedia player in the vehicle is located ], if it is determined that the passenger wants to start the multimedia player, a corresponding control instruction is output, and the multimedia player is controlled to start playing.

For another example, in addition to vehicle-related applications, various application scenarios such as game control, smart home device control, advertisement push, and the like may be included. Taking intelligent home control as an example, face images of target control persons can be acquired, and detection results of gazing region categories are detected through a pre-trained neural network, for example, the detection results can be [ in a period of time T, the gazing region of the control person is always the region where a gazing opening option on an intelligent air conditioner is located ], if the control person is determined to start the intelligent air conditioner, a corresponding control instruction is output, and the air conditioner is controlled to be opened.

Fig. 13 provides a training apparatus for a neural network for gaze area detection, as shown in fig. 13, which may include: a sample input module 1301, a category prediction module 1302, a variance determination module 1303, and a network adjustment module 1304.

The sample input module 1301 is configured to input at least a face image serving as a training sample into a neural network, where the face image includes gazing area category labeling information corresponding to a face in the face image, and a gazing area category labeled belongs to one of multiple types of defined gazing areas obtained by dividing a specified spatial area in advance;

the category prediction module 1302 is configured to perform feature extraction on the input face image via the neural network, and determine gazing area category prediction information of the face image according to the extracted features;

the difference determining module 1303 is configured to determine a difference between the obtained gaze area category prediction information and gaze area category labeling information of the corresponding image;

a network adjustment module 1304 for adjusting a network parameter of the neural network based on the difference.

In another embodiment, the sample processing module 1301 is specifically configured to: before at least inputting a face image serving as a training sample into a neural network, cutting at least one eye region in the face image to obtain at least one eye image; and simultaneously inputting the face image and at least one eye image of the face image into the neural network.

In another embodiment, the sample processing module 1301, when configured to input the face image and at least one eye image of the face image into the neural network at the same time, is specifically: adjusting the face image and each of the at least one eye image of the face image to the same predetermined size; simultaneously inputting the images with the adjusted sizes into the neural network;

the category prediction module 1302 is specifically configured to: and simultaneously extracting the characteristics of each image in the input face image and at least one eye image through the neural network, and determining the gazing area category prediction information of the face image according to the extracted characteristics.

In another embodiment, the sample processing module 1301, when configured to input the face image and at least one eye image of the face image into the neural network at the same time, is specifically: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

the category prediction module 1302 is specifically configured to: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the gazing area category prediction information of the face image according to the fusion characteristics.

In another embodiment, the category prediction module 1302, when configured to determine gaze area category prediction information of the face image based on the extracted features, includes: respectively carrying out dot product operation on the characteristics extracted from the face image and the weights of all the categories to obtain an intermediate vector; the category weights respectively correspond to the multi-category defined gazing areas; the number of dimensions of the intermediate vector is the same as the number of categories of the multi-category defined gazing area; when the extracted characteristics are calculated by category weight dot products corresponding to the gazing area category label information of the face image, the cosine value of the vector included angle between the characteristics and the category weights is adjusted so as to increase the inter-category distance and reduce the intra-category distance; and determining the gazing area category prediction information of the face image according to the intermediate vector.

In another embodiment, the specified spatial region includes: the spatial region of the vehicle.

In another embodiment, the face image is determined based on images acquired for a driving region in a spatial region of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

Fig. 14 provides a gaze area detection apparatus, as shown in fig. 14, which may include: an image acquisition module 1401, an image input module 1402, and a category prediction module 1403.

An image acquisition module 1401, configured to intercept a face area in an image acquired in a specified spatial area, to obtain a face image;

the image input module 1402 is configured to input the face image into a neural network, where the neural network performs training by using face image sets including gaze region category labeling information in advance, where a labeled gaze region category belongs to one of multiple types of defined gaze regions obtained by dividing the specified spatial region in advance;

the category prediction module 1403 is configured to perform feature extraction on the input face image via the neural network, and determine a gaze area detection category corresponding to the face image according to the extracted features.

In one embodiment, the image acquisition module 1401 is further configured to: cutting at least one eye region in the face image after cutting the face region in the image acquired in the appointed space region to obtain the face image, so as to obtain at least one eye image;

the image input module 1402 is specifically configured to: simultaneously inputting the face image and the at least one eye image of the face image into the neural network, wherein the neural network is completed by adopting a face image set comprising gazing area category labeling information in advance and eye image training intercepted based on each face image in the face image set;

The category prediction module 1403, when configured to perform feature extraction on the input face image via the neural network, includes: and extracting the characteristics of the input face image and at least one eye image through the neural network.

In one embodiment, the image input module 1401, when configured to input the face image and the at least one eye image into the neural network simultaneously, specifically includes: adjusting each image in the face image and the at least one eye image to the same preset size, and inputting each image with the adjusted size into the neural network at the same time;

the category prediction module 1402 is specifically configured to: and simultaneously extracting the characteristics of the input face image and at least one eye image through the neural network, and determining the gazing region detection category corresponding to the face image according to the extracted characteristics.

In one embodiment, the image input module 1401, when configured to input the face image and the at least one eye image into the neural network simultaneously, specifically includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

The category prediction module 1402 is specifically configured to: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the detection category of the gazing area corresponding to the face image according to the fusion characteristic.

In one embodiment, the image acquisition module 1401 is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles; and respectively determining the image with the highest image quality score in the images respectively acquired by the cameras at the same time according to the image quality evaluation index, and taking the image as the image to be subjected to the interception processing.

In one embodiment, the image acquisition module 1401 is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles;

The category prediction module 1402 is specifically configured to: detecting the gazing area by adopting a mode as any one of the description to each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment.

In one embodiment, the image quality evaluation index includes at least one of: whether an eye image, the definition of an eye area in the image, the shielding condition of the eye area in the image and the opening and closing condition of the eye area in the image are included in the image.

The category prediction module is specifically configured to 1402: detecting the gazing area by adopting a mode as any one of the description to each image of different angles acquired by each camera at the same moment to obtain a plurality of gazing area detection categories; and determining most of the plurality of gaze region detection categories as gaze region detection categories at the time.

In one embodiment, the specified spatial region includes: the spatial region of the vehicle.

In one embodiment, the image acquired by the image acquisition module in a specified spatial region includes: an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

In one embodiment, as shown in fig. 15, the apparatus may further include at least one of the following modules:

A first class application module 1404 for: after the category prediction module determines the gazing area detection category corresponding to the face image according to the extracted characteristics, determining the attention monitoring result of the person corresponding to the face image according to the gazing area category detection result; and outputting the attention monitoring result, and/or outputting distraction prompt information according to the attention monitoring result.

A second category application module 1405 for: after the category prediction module determines a gazing area detection category corresponding to the face image according to the extracted features, determining a control instruction corresponding to the gazing area category detection result according to the gazing area category detection result; and controlling the electronic equipment to execute the operation corresponding to the control instruction.

Fig. 16 is a training apparatus for a neural network for gaze area detection provided in at least one embodiment of the present disclosure, as shown in fig. 16, the apparatus may include a memory 1601, and a processor 1602, where the memory 1601 is configured to store computer instructions executable on the processor, and the processor 1602 is configured to implement the training method for the neural network for gaze area detection according to any embodiment of the present disclosure when the computer instructions are executed.

Fig. 17 is a gaze area detection apparatus provided in at least one embodiment of the present disclosure, as shown in fig. 17, the apparatus may include a memory 1701, and a processor 1702, where the memory 1701 is configured to store computer instructions executable on the processor, and the processor 1702 is configured to implement a gaze area detection method according to any embodiment of the present disclosure when executing the computer instructions.

At least one embodiment of the present disclosure further provides a computer readable storage medium, on which a computer program is stored, where the program when executed by a processor implements a training method for a neural network for gaze area detection as described in any of the present disclosure, and/or implements a gaze area detection method as described in any of the present disclosure.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The embodiments of the present specification also provide a computer-readable storage medium, on which a computer program may be stored, which when executed by a processor, implements the steps of the method for detecting a driver's gaze area described in any of the embodiments of the present specification, and/or implements the steps of the method for training a neural network of a driver's gaze area described in any of the embodiments of the present specification. Wherein the term "and/or" means at least one of the two, e.g., "a and/or B" includes three schemes: A. b, and "a and B".

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. A training method of a neural network for gaze area detection, the method comprising:

adjusting network parameters of the neural network based on the differences;

before the at least inputting the face image as the training sample into the neural network, the method further comprises: cutting at least one eye region in the face image to obtain at least one eye image;

the inputting of at least the face image as the training sample into the neural network comprises: simultaneously inputting the face image and the at least one eye image of the face image into the neural network;

the simultaneously inputting the face image and the at least one eye image of the face image into the neural network includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

The feature extraction is performed on the input face image through the neural network, and the gazing area category prediction information of the face image is determined according to the extracted features, including: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the gazing area category prediction information of the face image according to the fusion characteristics.

2. The method according to claim 1, wherein the feature extraction of the input face image via the neural network and the determination of gaze region category prediction information of the face image based on the extracted features, comprises:

respectively carrying out dot product operation on the characteristics extracted from the face image and the weights of all the categories to obtain an intermediate vector; the category weights respectively correspond to the multi-category defined gazing areas; the number of dimensions of the intermediate vector is the same as the number of categories of the multi-category defined gazing area; when the extracted characteristics are calculated by category weight dot products corresponding to the gazing area category label information of the face image, the cosine value of the vector included angle between the characteristics and the category weights is adjusted so as to increase the inter-category distance and reduce the intra-category distance;

And determining the gazing area category prediction information of the face image according to the intermediate vector.

3. A method according to claim 1 or 2, characterized in that,

the specified spatial region includes: the spatial region of the vehicle.

4. A method according to claim 3,

the face image is determined based on an image acquired for a driving region in a spatial region of the vehicle;

the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

5. A method of gaze region detection, the method comprising:

inputting the face image into a neural network, wherein the neural network finishes training the face image set which comprises the gazing area category marking information in advance, and the marked gazing area category belongs to one of a plurality of types of defined gazing areas which are obtained by dividing a specified space area in advance;

Extracting features of the input face image through the neural network, and determining a gazing area detection category corresponding to the face image according to the extracted features;

after the face area in the image acquired in the appointed space area is intercepted to obtain the face image, the method further comprises the following steps: cutting at least one eye region in the face image to obtain at least one eye image;

the inputting the face image into the neural network comprises: simultaneously inputting the face image and the at least one eye image of the face image into the neural network, wherein the neural network is completed by adopting a face image set comprising gazing area category labeling information in advance and eye image training intercepted based on each face image in the face image set;

the feature extraction of the input face image through the neural network comprises the following steps: the neural network is used for extracting the characteristics of the input face image and at least one eye image;

the inputting the face image and the at least one eye image simultaneously into a neural network includes: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

6. The method of claim 5, the intercepting a face region in an image acquired within a specified spatial region, the method further comprising:

acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region;

and respectively determining the image with the highest image quality score in the images respectively acquired by the cameras at the same time according to the image quality evaluation index, and taking the image as the image to be subjected to the interception processing.

7. The method of claim 5 or 6, the intercepting a face region in an image acquired within a specified spatial region, the method further comprising: acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region;

the determining the gaze region detection category corresponding to the face image according to the extracted features includes: detecting the gazing area by the method as set forth in claim 5 for each image of different angles acquired by each camera at the same time to obtain a plurality of gazing area detection categories; and selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment.

8. The method of claim 6 or 7, the image quality evaluation index comprising at least one of: whether an eye image, the definition of an eye area in the image, the shielding condition of the eye area in the image and the opening and closing condition of the eye area in the image are included in the image.

9. The method of claim 5, the intercepting the face region image and the at least one eye region image in the image acquired in the specified spatial region, the method further comprising: acquiring images acquired by a plurality of cameras deployed in a specified space region, wherein the images are images acquired by the cameras from different angles in a specific sub-region in the specified space region;

the determining the gaze region detection category corresponding to the face image according to the extracted features includes: detecting the gazing area by the method as set forth in claim 5 for each image of different angles acquired by each camera at the same time to obtain a plurality of gazing area detection categories; and determining most of the plurality of gaze region detection categories as gaze region detection categories at the time.

10. The method according to any one of claim 5 to 9,

the specified spatial region includes: the spatial region of the vehicle.

11. The method according to claim 10,

the image acquired in the appointed space region comprises: an image acquired for a driving area in a spatial area of the vehicle;

12. The method according to any one of claims 5 to 11, wherein after determining a gaze area detection category corresponding to the face image according to the extracted features, the method further comprises:

determining the attention monitoring result of the person corresponding to the face image according to the detection result of the gazing region category;

outputting the attention monitoring result and/or outputting distraction information according to the attention monitoring result.

13. The method according to any one of claims 5 to 11, wherein after determining a gaze area detection category corresponding to the face image according to the extracted features, the method further comprises:

determining a control instruction corresponding to a gazing area category detection result according to the gazing area category detection result;

And controlling the electronic equipment to execute the operation corresponding to the control instruction.

14. A training device for a neural network for gaze area detection, the device comprising:

a network adjustment module for adjusting network parameters of the neural network based on the differences;

the sample processing module is specifically used for: before at least inputting a face image serving as a training sample into a neural network, cutting at least one eye region in the face image to obtain at least one eye image; simultaneously inputting the face image and at least one eye image of the face image into the neural network;

The sample processing module, when being used for inputting the face image and at least one eye image of the face image into the neural network at the same time, specifically comprises: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

15. The apparatus of claim 14, wherein the device comprises a plurality of sensors,

the category prediction module, when determining gazing area category prediction information of the face image according to the extracted features, includes: respectively carrying out dot product operation on the characteristics extracted from the face image and the weights of all the categories to obtain an intermediate vector; the category weights respectively correspond to the multi-category defined gazing areas; the number of dimensions of the intermediate vector is the same as the number of categories of the multi-category defined gazing area; when the extracted characteristics are calculated by category weight dot products corresponding to the gazing area category label information of the face image, the cosine value of the vector included angle between the characteristics and the category weights is adjusted so as to increase the inter-category distance and reduce the intra-category distance; and determining the gazing area category prediction information of the face image according to the intermediate vector.

16. The apparatus of claim 14 or 15, wherein the designated spatial region comprises: the spatial region of the vehicle.

17. The apparatus of claim 16, wherein the face image is determined based on an image acquired for a driving region in a spatial region of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

18. A gaze region detection apparatus, the apparatus comprising:

the image input module is used for inputting the face image into a neural network, wherein the neural network is finished by training face image collection comprising gazing area category marking information in advance, and the marked gazing area category belongs to one of a plurality of types of defined gazing areas obtained by dividing a specified space area in advance;

The category prediction module is used for extracting the characteristics of the input face image through the neural network and determining the gazing area detection category corresponding to the face image according to the extracted characteristics;

the image acquisition module is further configured to: cutting at least one eye region in the face image after cutting the face region in the image acquired in the appointed space region to obtain the face image, so as to obtain at least one eye image;

the image input module is specifically configured to: simultaneously inputting the face image and the at least one eye image of the face image into the neural network, wherein the neural network is completed by adopting a face image set comprising gazing area category labeling information in advance and eye image training intercepted based on each face image in the face image set;

the category prediction module, when used for extracting the characteristics of the input face image through the neural network, comprises: the neural network is used for extracting the characteristics of the input face image and at least one eye image;

the image input module, when being used for inputting the face image and the at least one eye image into the neural network at the same time, specifically comprises: correspondingly inputting the face image and the at least one eye image into different feature extraction branches included in the neural network, wherein the face image and the eye image which are input into the neural network are different in size;

The category prediction module is specifically configured to: extracting the features of the face image or the eye image input into each feature extraction branch through each feature extraction branch respectively; fusing the features of the face image and the features of the eye image extracted by each feature extraction branch to obtain fused features; and determining the detection category of the gazing area corresponding to the face image according to the fusion characteristic.

19. The apparatus according to claim 18,

the image acquisition module is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles; and respectively determining the image with the highest image quality score in the images respectively acquired by the cameras at the same time according to the image quality evaluation index, and taking the image as the image to be subjected to the interception processing.

20. The apparatus according to claim 18,

the image acquisition module is further configured to: before capturing a face region in an image acquired in a specified space region, acquiring images acquired by a plurality of cameras deployed in the specified space region, wherein each image is an image in a specific sub-region in the specified space region acquired by the plurality of cameras from different angles;

The category prediction module is specifically configured to: detecting the gazing area by the method as set forth in claim 5 for each image of different angles acquired by each camera at the same time to obtain a plurality of gazing area detection categories; and selecting a gaze region detection category corresponding to the image with the highest image quality score determined according to the image quality evaluation index from the plurality of gaze region detection categories as the gaze region detection category at the moment.

21. The apparatus according to claim 19 or 20, wherein the image quality evaluation index includes at least one of: whether an eye image, the definition of an eye area in the image, the shielding condition of the eye area in the image and the opening and closing condition of the eye area in the image are included in the image.

22. The apparatus according to claim 18,

The category prediction module is specifically configured to: detecting the gazing area by the method as set forth in claim 5 for each image of different angles acquired by each camera at the same time to obtain a plurality of gazing area detection categories; and determining most of the plurality of gaze region detection categories as gaze region detection categories at the time.

23. The apparatus of any of claims 18-22, the designated spatial region comprising: the spatial region of the vehicle.

24. The apparatus according to claim 23,

the image acquired in the appointed space region intercepted by the image acquisition module comprises the following steps: an image acquired for a driving area in a spatial area of the vehicle; the multi-class defined gazing area obtained by dividing the specified space area comprises the following two classes or more than two classes: a left front windshield area, a right front windshield area, an instrument panel area, an in-vehicle rearview mirror area, a center console area, a left rearview mirror area, a right rearview mirror area, a light shielding plate area, a gear lever area, a steering wheel lower area, a copilot area, and a glove box area in front of copilot.

25. The apparatus according to any one of claims 18 to 24, further comprising:

A first class application module for: after the category prediction module determines the gazing area detection category corresponding to the face image according to the extracted characteristics, determining the attention monitoring result of the person corresponding to the face image according to the gazing area category detection result; and outputting the attention monitoring result, and/or outputting distraction prompt information according to the attention monitoring result.

26. The apparatus according to any one of claims 18 to 24, further comprising:

a second class application module for: after the category prediction module determines a gazing area detection category corresponding to the face image according to the extracted features, determining a control instruction corresponding to the gazing area category detection result according to the gazing area category detection result; and controlling the electronic equipment to execute the operation corresponding to the control instruction.

27. A training device for a neural network for gaze area detection, the device comprising a memory, a processor, the memory for storing computer instructions executable on the processor for implementing the method of any of claims 1 to 4 when the computer instructions are executed.

28. A gaze region detection apparatus, characterized in that the apparatus comprises a memory for storing computer instructions executable on a processor for implementing the method of any of claims 5 to 13 when executing the computer instructions.

29. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any one of claims 1 to 4 and/or implements the method of any one of claims 5 to 13.