CN113221835A

CN113221835A - Scene classification method, device, equipment and storage medium for face-check video

Info

Publication number: CN113221835A
Application number: CN202110610391.9A
Authority: CN
Inventors: 潘浩; 庄伯金
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-08-06
Anticipated expiration: 2041-06-01
Also published as: CN113221835B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a scene classification method, a device, equipment and a storage medium for a face-up video, wherein at least one video image is obtained from the face-up video; processing at least one video image to obtain an identification image for scene classification; carrying out image feature extraction on the identification image to obtain a feature map of the identification image; processing the identification image to obtain a portrait suppression map of the identification image, wherein the portrait suppression map is a characteristic map, the average value of the portrait area of the portrait suppression map in the identification image is not more than a, the average value of the non-portrait area in the identification image is more than a, and a is more than 0 and less than 1; multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map; the scene characteristic graphs are classified through the scene classification model, the scene type of the face audit video is obtained, and the classification efficiency of the face audit video is improved.

Description

Scene classification method, device, equipment and storage medium for face-check video

Technical Field

The present application relates to the field of classification algorithm technologies, and in particular, to a scene classification method, apparatus, device, and storage medium for a review video.

Background

Scene classification is a large concern in the field of visual applications, and is a method for determining the environmental category of an input according to the input image or video, such as: classrooms, airports, libraries, and the like. Image scene classification and video scene classification can be classified according to the type of input data. For image scene classification tasks, relatively high identification accuracy can be achieved by using a relatively mature two-dimensional (2D) Convolutional Neural Network (CNN), and for video tasks, methods such as a three-dimensional (3D) convolutional neural network, a double-current convolutional neural network, and a cyclic neural network can be used for processing.

In the general scene classification, scene information occupies most space in an image, and the information can easily exert the performance of a convolutional neural network. However, for a face-up video, the human face occupies most of the position in the picture, and the scene information only occupies a small part in the picture. The face-to-face review video refers to face-to-face review video generated under credit approval, interview, or other scenarios. How to utilize only the information in a section of video to carry out effective scene classification, at present, a three-dimensional convolution neural network is usually used for carrying out video feature extraction and classification, but the video feature extraction and classification are easily influenced by human images in the video, the workload of extracting video features is large, and the videos cannot be efficiently classified and subjected to surface examination, so that the surface examination and classification efficiency is low.

Disclosure of Invention

The embodiment of the application provides a scene classification method, a scene classification device, scene classification equipment and a storage medium for face examination videos, and aims to solve the problem that the face examination video classification efficiency is low in the prior art.

In a first aspect, an embodiment of the present application provides a scene classification method for a review video, including: acquiring at least one video image from a face-examination video; processing the at least one video image to obtain an identification image for scene classification; carrying out image feature extraction on the identification image to obtain a feature map of the identification image; performing image feature processing on the identification image, controlling the average feature value of a portrait area in the identification image to be (0, a), and controlling the average feature value of a non-portrait area in the identification image to be [ a, 1), so as to obtain a portrait suppression map of the identification image; multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map; and classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

In an optional implementation manner, the obtaining at least one video image from the review video includes: intercepting a plurality of video images from the face-examination video according to a video intercepting period; the video capturing period and the video face examination duration are determined, and the number of the video images meets the following requirements: (N +1) × X > Y, where N is the number of video images, W is the video capture period, Y is the duration of the face-up video, and N is an integer greater than 1.

In an optional implementation manner, the processing the at least one video image to obtain an identification image for scene classification includes: and splicing the N video images in a tiled mode to obtain the identification image for scene classification, wherein the N video images have no overlapped pixels in the identification image, and the interval between the adjacent video images is not more than 1 pixel.

In an optional implementation manner, the obtaining the portrait suppression map of the recognition image by performing image feature processing on the recognition image, controlling an average feature value of a portrait area in the recognition image to be (0, a), and controlling an average feature value of a non-portrait area in the recognition image to be [ a, 1 includes: inputting the identification image into a portrait inhibitor to obtain a portrait inhibition map; the portrait suppressor is a trained two-dimensional convolutional neural network, and is used for recognizing a portrait area and a non-portrait area of the recognition image, converting the portrait area of the recognition image into a characteristic value b, and converting the non-portrait area of the recognition image into a characteristic value c, wherein b is more than or equal to 0 and less than a and less than or equal to 1.

In an optional implementation manner, before acquiring at least one video image from the review video, the method further includes: training the portrait inhibitor by using a face examination image in a database and a portrait mask corresponding to the face examination image, wherein the portrait mask corresponds to the characteristic value of a portrait area in the face examination image as b, and the portrait mask corresponds to the characteristic value of a non-portrait area in the face examination image as c; and adjusting the loss function of the portrait inhibitor until the similarity between the characteristic graph obtained by inputting the face-examination image into the portrait inhibitor and the portrait mask is greater than a threshold value.

In an optional implementation manner, the scene feature classification model includes a feature extractor, a pooling layer, and a full-link layer, and the image feature extractor is a two-dimensional convolutional neural network for extracting scene features.

In an optional implementation manner, the classifying the scene feature map by a scene classification model to obtain a scene type of the review video includes: extracting, by the feature extractor, the scene feature from the scene feature map; sampling and compressing the scene features through the pooling layer to obtain dimension-reduced scene features; inputting the dimension reduction scene features to the full-connection layer to obtain a prediction vector of the scene features; and obtaining the scene type of the face-up video according to the preset corresponding relation between the prediction vector and the scene type.

In a second aspect, an embodiment of the present application provides a scene classification device for a review video, which includes: the acquisition unit is used for acquiring at least one video image from the face examination video; the first processing unit is used for processing the at least one video image to obtain an identification image for scene classification; the extraction unit is used for carrying out image feature extraction on the identification image to obtain a feature map of the identification image; a second processing unit, configured to perform image feature processing on the identification image, control an average feature value of a portrait area in the identification image to be (0, a), and control an average feature value of a non-portrait area in the identification image to be [ a, 1), so as to obtain a portrait suppression map of the identification image; the multiplication unit is used for multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map; and the classification unit is used for classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

In an optional implementation manner, the obtaining unit is specifically configured to capture a plurality of video images from the review video according to a video capture period; the video capturing period and the video face examination duration are determined, and the number of the video images meets the following requirements: (N +1) × X > Y, where N is the number of video images, W is the video capture period, Y is the duration of the face-up video, and N is an integer greater than 1.

In an optional implementation manner, the first processing unit is specifically configured to tile the N video images to obtain the identification image for scene classification, where the N video images do not have overlapping pixels in the identification image, and an interval between adjacent video images is not more than 1 pixel.

In an optional implementation manner, the second processing unit is specifically configured to input the identification image to a portrait inhibitor, so as to obtain the portrait inhibition map; the portrait suppressor is a trained two-dimensional convolutional neural network, and is used for recognizing a portrait area and a non-portrait area of the recognition image, converting the portrait area of the recognition image into a characteristic value b, and converting the non-portrait area of the recognition image into a characteristic value c, wherein b is more than or equal to 0 and less than a and less than or equal to 1.

In an optional implementation, the apparatus further includes: the training unit is used for training the portrait inhibitor by using a face examination image in a database and a portrait mask corresponding to the face examination image, wherein the portrait mask corresponds to the characteristic value of a portrait area in the face examination image as b, and the portrait mask corresponds to the characteristic value of a non-portrait area in the face examination image as c; and adjusting the loss function of the portrait inhibitor until the similarity between the characteristic graph obtained by inputting the face-examination image into the portrait inhibitor and the portrait mask is greater than a threshold value.

In an optional implementation manner, the classification unit is specifically configured to: extracting, by the feature extractor, the scene feature from the scene feature map; sampling and compressing the scene features through the pooling layer to obtain dimension-reduced scene features; inputting the dimension reduction scene features to the full-connection layer to obtain a prediction vector of the scene features; and obtaining the scene type of the face-up video according to the preset corresponding relation between the prediction vector and the scene type.

In a third aspect, an embodiment of the present application further provides an apparatus, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the scene classification method for a review video according to the first aspect.

In a fourth aspect, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the scene classification method for a review video according to the first aspect.

The embodiment of the application provides a scene classification method, a device, equipment and a storage medium of a face examination video.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a scene classification method for a review video in an embodiment of the present application;

fig. 2 is a schematic flowchart of a scene classification method for a review video according to an embodiment of the present application;

fig. 3A is a schematic flowchart of another scene classification method for a review video according to an embodiment of the present disclosure;

fig. 3B is a schematic view illustrating a scene classification process of a review video according to an embodiment of the present application;

fig. 4 is a schematic block diagram of a scene classification device for a review video provided in an embodiment of the present application;

fig. 5 is a schematic block diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a scene classification method for a review video according to an embodiment of the present application; fig. 2 is a schematic flowchart of a scene classification method for a review video according to an embodiment of the present application, where the scene classification method for a review video is applied to a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method includes steps S101 to S106.

S101, acquiring at least one video image from a face-examination video.

The scene classification device acquires at least one video image from the face-up video.

It can be understood that the review video is a video that requires scene classification. The face examination video has more portrait area ratio, and is compared with the non-face examination video, so that the scenes in the face examination video are difficult to identify. But the scenes in the face-up video cannot be changed easily, so the scenes of the face-up video have no correlation with the time sequence. The scene classification device obtains at least one video image from the face examination video, classifies the face examination video through the video image of the face examination video, and is beneficial to simplifying the complexity of the classification of the face examination video.

Optionally, the scene classification device obtains at least one video image from the review video, and specifically includes: the scene classification device intercepts a plurality of video images from the face examination video according to the video intercepting period; the video capturing period and the video face examination duration are determined, and the number of the video images meets the following requirements: (N +1) × X > Y, where N is the number of video images, W is the video capture period, Y is the duration of the face-up video, and N is an integer greater than 1. In the implementation mode, the scene classification device classifies the face-examination videos through two or more video images, and is beneficial to acquiring more scene features.

Specifically, for example, the scene classification device determines the number of N, determines a video capture period according to the duration of the face-up video, and finally installs the video capture period to capture the video image from the face-up video. For example, if N is 4 and the duration of the face-up video is 20 minutes, the scene classification apparatus determines that the range of the video capture period is greater than 4 minutes and less than or equal to 5 minutes. The implementation mode is simple and convenient to acquire a plurality of video images.

S102, processing at least one video image to obtain an identification image for scene classification.

The scene classification device processes at least one video image to obtain an identification image for scene classification. It should be noted that, in the case of only one video image, the scene classification apparatus performs preprocessing, such as noise reduction, on the video image, so as to more effectively extract the image features of the video image. And under the condition that at least two video images exist, the scene classification device splices the at least two video images, or splices and denoises the at least two video images to obtain the identification image.

Optionally, under the condition that the scene classification device intercepts N video images from the face-up video, the scene classification device splices the N video images in a tiled manner to obtain the identification image. In the identification image, the N video images have no overlapped pixels, and the interval between the N video images does not exceed 1 pixel, so that the N video images have no overlapping and no gap. The method can keep the image characteristics of the video images in the splicing process, reduces the increase of redundant characteristics of the video images in the splicing process, and is beneficial to improving the accuracy of scene classification.

And S103, carrying out image feature extraction on the identification image to obtain a feature map of the identification image.

And the scene classification device extracts the image characteristics of the identification image to obtain a characteristic image of the identification image. The image feature extraction is used for converting the image into data, performing feature mapping on pixels in the image, and representing the brightness of each area in the original image by using pixel values from 0 to 1 to obtain a two-dimensional pixel matrix, which is also called a feature map. Here, the pixel value "1" represents white, and the pixel value "0" represents black. The feature map of the recognition image includes features of the portrait area and features of the non-portrait area.

S104, performing image feature processing on the recognition image, controlling the average feature value of the portrait area in the recognition image to be (0, a), and controlling the average feature value of the non-portrait area in the recognition image to be [ a, 1), so as to obtain a portrait suppression map of the recognition image.

The scene classification device performs image feature processing on the recognition image, controls the average feature value of the portrait area in the recognition image to be (0, a), and controls the average feature value of the non-portrait area in the recognition image to be [ a, 1), thereby obtaining a portrait suppression map of the recognition image.

Specifically, the scene classification device inputs the recognition image into a portrait inhibitor to obtain a portrait inhibition map, the portrait inhibitor is a trained two-dimensional convolutional neural network, the portrait inhibitor is used for recognizing a portrait area and a non-portrait area of the recognition image, converting the portrait area of the recognition image into a characteristic value b, and converting the non-portrait area of the recognition image into a characteristic value c, wherein b is greater than or equal to 0 and less than a and less than or equal to 1.

In an alternative implementation, b is 0 and c is 1. The realization mode is favorable for removing the portrait characteristics in the identification image through the portrait inhibition image and reserving the non-portrait characteristics of the identification image. It can be understood that the feature value is multiplied by 1, or the original feature value represents the original image feature. And if the characteristic value is multiplied by 0, the characteristic value returns to 0, and the characteristic values of the portrait areas are multiplied by 0 and are all converted into 0, so that the portrait characteristics are eliminated.

And S105, multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map.

And the scene classification device multiplies the feature map of the identification image by the portrait suppression map to obtain a scene feature map. It can be understood that the feature value reduction proportion of the scene feature map in the portrait area obtained by multiplying the feature map of the identification image by the portrait suppression map is greater than the feature value reduction proportion in the non-portrait area. If the scene feature map is converted into a color image, the brightness of the portrait area in the scene feature map is reduced compared with that of the identification image. The noise for scene classification in the identification image is mainly the portrait characteristics, the non-portrait characteristics in the scene characteristic image are highlighted, the non-portrait characteristics comprise the scene characteristics, the portrait characteristics are reduced, and the classification of the face-up video is facilitated.

And S106, classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

The scene classification device classifies the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video, and the scene classification model is used for predicting the scene type of the scene characteristic graph. The scene feature map is obtained by inhibiting the portrait features, and the scene feature map is predicted by using the scene classification model, so that the accuracy of prediction is improved.

The scene classification model comprises a feature extractor, a pooling layer and a full-connection layer, wherein the feature extractor is a two-dimensional convolution neural network. The implementation method does not need to use a three-dimensional convolutional neural network, and the complexity of video classification is reduced.

Specifically, the scene classification device performs feature extraction on the scene feature map through the feature extractor to obtain the scene features. Because the scene feature map comprises all features of the non-portrait area and portrait features which cannot be completely removed, the scene classification device extracts the scene features from the scene feature map through the feature extractor, and the accuracy of scene classification is further improved. And the scene classification device performs sampling compression on the scene features through the pooling layer to obtain the dimension-reduced scene features. The pooling layer is used for simplifying the calculation complexity and reducing the parameters and the calculation amount of the full connection layer. And the scene classification device inputs the dimensionality reduction scene features to the full-connection layer to obtain the prediction vector of the scene features. The prediction vectors correspond to preset scene types, each of which represents a scene type, for example, "1" represents an airport and "8" represents a library. And finally, the scene classification device obtains the scene type of the face-up video according to the preset corresponding relation between the prediction vector and the scene type.

Optionally, after the scene classification device obtains the scene type of the review video, the scene type of the review video is output, where the scene classification device may output the scene type of the review video in a text manner, or in an image or voice manner, which is not limited herein.

The method realizes that the scene characteristic diagram of the face-up video is obtained by intercepting the video image in the face-up video and using the video image, and the face-up video is subjected to scene waste according to the scene characteristic diagram, thereby improving the scene classification efficiency of the face-up video.

Fig. 3A is a schematic flowchart of a scene classification method for a review video according to an embodiment of the present application, where the scene classification method for a review video is applied to a server, and the method is executed by application software installed in the server.

As shown in fig. 3A, the method includes steps S201 to S208.

S201, training the portrait inhibitor by using the face examination image in the database and the portrait mask corresponding to the face examination image, wherein the characteristic value of the portrait area in the face examination image corresponding to the portrait mask is b, and the characteristic value of the non-portrait area in the face examination image corresponding to the portrait mask is c.

The scene classification device trains the portrait inhibitor by using the face examination images in the database and the portrait mask corresponding to the face examination image set, wherein the value of the portrait area of the portrait mask in the face examination images is b, the value of the portrait area of the portrait mask in the face examination images is c, and the portrait mask is a feature image, so that the scene classification device is an ideal output result of the portrait inhibitor.

It is understood that the face-examination image and the face mask in the database are training materials for the face suppressor. The face image in the database is input data, and the portrait mask is an output result expected to be achieved by the scene classification device. The greater the number of face images and corresponding portrait masks, the easier it is to train the portrait inhibitor.

Optionally, b is 0 and c is 1. The realization mode is favorable for training the portrait inhibition image to remove the portrait characteristics in the recognition image and reserve the non-portrait characteristics of the recognition image. It can be understood that the feature value is multiplied by 1, or the original feature value represents the original image feature. And if the characteristic value is multiplied by 0, the characteristic value returns to 0, and the characteristic values of the portrait areas are multiplied by 0 and are all converted into 0, so that the portrait characteristics are eliminated.

S202, adjusting a loss function of the portrait inhibitor until the similarity between a feature image obtained by examining the image on the input surface by the portrait inhibitor and the portrait mask is smaller than a threshold value.

The scene classification device adjusts a loss function of the portrait inhibitor until the similarity between a feature image obtained by examining the image on the input surface by the portrait inhibitor and the portrait mask is smaller than a threshold value. The loss function (lo function) is used to measure the degree of inconsistency between the output result of the portrait inhibitor and the portrait mask, is an important parameter of the neural network, and is related to the accuracy of the output data of the portrait inhibitor. The scene classification device carries out data input on the portrait inhibitor for multiple times through training materials, and adjusts a loss function of the portrait inhibitor until the similarity between an output result and the portrait mask is larger than a threshold value. Specifically, the scene classification device makes a difference between the output result and each position feature value of the portrait mask, adds absolute values of the results obtained by making the difference, and determines that the similarity between the output result and the portrait mask is greater than a threshold value if the sum is smaller than the error value. The similarity may be the reciprocal of the sum, and the threshold may be adjusted according to the actual situation, which is not limited herein.

Fig. 3B is a schematic view of a scene classification process of a face-up video according to the present application, in which a figure mask is shown, and a first image is an identification image spliced by the scene classification device when N is 4. In FIG. 3B, B is 0 and c is 1. The portrait mask is the output data of the portrait suppressor in an ideal state.

It will be appreciated that the portrait area in the output data does not coincide exactly with the portrait area in the input data. The scene classification device cannot completely remove the portrait character even when b is 0 by multiplying the portrait suppression code output from the portrait suppressor by the feature map. The scene classification device can further extract scene features through a feature extractor in the scene classification model.

S203, acquiring at least one video image from the face-examination video.

S204, processing at least one video image to obtain an identification image for scene classification.

And S205, carrying out image feature extraction on the identification image to obtain a feature map of the identification image.

S206, carrying out image feature processing on the recognition image, controlling the average feature value of the portrait area in the recognition image to be (0, a), and controlling the average feature value of the non-portrait area in the recognition image to be [ a, 1), so as to obtain a portrait suppression map of the recognition image.

And S207, multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map.

And the scene classification device multiplies the feature map of the identification image by the portrait suppression map to obtain a scene feature map. It can be understood that the feature value reduction proportion of the scene feature map in the portrait area obtained by multiplying the feature map of the identification image by the portrait suppression map is greater than the feature value reduction proportion in the non-portrait area. If the scene feature map and the identification image are converted into color images, the brightness of the portrait area in the scene feature map is reduced. The non-portrait characteristics in the scene characteristic graph are highlighted, the non-portrait characteristics comprise the scene characteristics, the portrait characteristics are reduced, and the classification of the face-up videos is facilitated.

And S208, classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

The embodiment of the application also provides a scene classification device of the review video, which is used for executing any embodiment of the scene classification method of the review video. Specifically, referring to fig. 4, fig. 4 is a schematic block diagram of a scene classification device for a review video according to an embodiment of the present application. The scene classification device for the review video can be configured in the server.

As shown in fig. 4, the scene classification apparatus for a review video includes: an obtaining unit 401, configured to obtain at least one video image from a face-up video; a first processing unit 402, configured to process at least one video image to obtain an identification image for scene classification; an extracting unit 403, configured to perform image feature extraction on the identification image to obtain a feature map of the identification image; a second processing unit 404, configured to perform image feature processing on the recognition image, control the average feature value of the portrait area in the recognition image to be (0, a), and control the average feature value of the non-portrait area in the recognition image to be [ a, 1), so as to obtain a portrait suppression map of the recognition image; a multiplying unit 405, configured to multiply the feature map of the identification image with the portrait suppression map to obtain a scene feature map; and the classifying unit 406 is configured to classify the scene feature map through a scene classification model to obtain a scene type of the review video.

In an optional implementation manner, the obtaining unit 401 is specifically configured to capture a plurality of video images from a face-up video according to a video capture period; the video capturing period and the video face examination duration are determined, and the number of the video images meets the following requirements: (N +1) × X > Y, where N is the number of video images, W is the video capture period, Y is the duration of the face-up video, and N is an integer greater than 1.

In an alternative implementation manner, the first processing unit 402 is specifically configured to tile N video images to obtain an identification image for scene classification, where the N video images do not have overlapping pixels in the identification image, and an interval between adjacent video images is not more than 1 pixel.

In an alternative implementation manner, the second processing unit 404 is specifically configured to input the identification image to the portrait inhibitor, so as to obtain a portrait inhibition map; the portrait suppressor is a trained two-dimensional convolutional neural network and is used for recognizing a portrait area and a non-portrait area of the recognition image, converting the portrait area of the recognition image into a characteristic value b and converting the non-portrait area of the recognition image into a characteristic value c, wherein b is more than or equal to 0 and less than a and less than or equal to 1.

In an optional implementation, the apparatus further includes: a training unit 407, configured to train a portrait inhibitor using the face examination image in the database and a portrait mask corresponding to the face examination image, where a feature value of a portrait area in the face examination image corresponding to the portrait mask is b, and a feature value of a non-portrait area in the face examination image corresponding to the portrait mask is c; and adjusting the loss function of the portrait inhibitor until the similarity between the characteristic graph obtained by examining the image on the input surface of the portrait inhibitor and the portrait mask is greater than a threshold value.

In an alternative implementation manner, the classifying unit 406 is specifically configured to: extracting scene features from the scene feature map through a feature extractor; sampling and compressing the scene characteristics through the pooling layer to obtain dimension reduction scene characteristics; inputting the dimension reduction scene characteristics to a full connection layer to obtain a prediction vector of the scene characteristics; and obtaining the scene type of the face-up video according to the preset corresponding relation between the prediction vector and the scene type.

The device realizes that the scene characteristic diagram of the face-up video is obtained by intercepting the video image in the face-up video and using the video image, and the face-up video is subjected to scene waste according to the scene characteristic diagram, so that the scene classification efficiency of the face-up video is improved.

The above-described scene classification means for reviewing videos may be implemented in the form of a computer program that may be run on a device as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of an apparatus provided in an embodiment of the present application. The device 500 is a server, which may be an independent server or a server cluster composed of a plurality of servers.

Referring to fig. 5, the device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a scene classification method for face-view video.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall device 500.

The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a scene classification method for viewing videos.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the subject application and does not constitute a limitation on the device 500 to which the subject application is applied, and that a particular device 500 may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the scene classification method for the review video disclosed in the embodiment of the present application.

Those skilled in the art will appreciate that the embodiment of the apparatus shown in fig. 5 does not constitute a limitation on the specific construction of the apparatus, and in other embodiments the apparatus may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are the same as those of the embodiment shown in fig. 5, and are not described herein again.

It should be understood that in the embodiments of the present application, the processor 502 may be a Central Processing Unit (CPU), and the processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the scene classification method for a review video disclosed in the embodiments of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A scene classification method of a face examination video is characterized by comprising the following steps:

acquiring at least one video image from a face-examination video;

processing the at least one video image to obtain an identification image for scene classification;

carrying out image feature extraction on the identification image to obtain a feature map of the identification image;

performing image feature processing on the identification image, controlling the average feature value of a portrait area in the identification image to be (0, a), and controlling the average feature value of a non-portrait area in the identification image to be [ a, 1), so as to obtain a portrait suppression map of the identification image;

multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map;

and classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

2. The method of claim 1, wherein the obtaining at least one video image from the review video comprises:

intercepting a plurality of video images from the face-examination video according to a video intercepting period;

the video capturing period and the video face examination duration are determined, and the number of the video images meets the following requirements: (N +1) × X > Y, where N is the number of video images, W is the video capture period, Y is the duration of the face-up video, and N is an integer greater than 1.

3. The method of claim 2, wherein processing the at least one video image to obtain an identification image for scene classification comprises:

and splicing the N video images in a tiled mode to obtain the identification image for scene classification, wherein the N video images have no overlapped pixels in the identification image, and the interval between the adjacent video images does not exceed 1 pixel.

4. The method according to claim 2, wherein the obtaining of the portrait suppression map of the recognition image by performing image feature processing on the recognition image, controlling an average feature value of a portrait area in the recognition image to be (0, a), and controlling an average feature value of a non-portrait area in the recognition image to be [ a, 1), comprises:

inputting the identification image into a portrait inhibitor to obtain a portrait inhibition map;

the portrait suppressor is a trained two-dimensional convolutional neural network, and is used for recognizing a portrait area and a non-portrait area of the recognition image, converting the portrait area of the recognition image into a characteristic value b, and converting the non-portrait area of the recognition image into a characteristic value c, wherein b is more than or equal to 0 and less than a and less than or equal to 1.

5. The method of claim 4, wherein prior to said obtaining at least one video image from a review video, the method further comprises:

training the portrait inhibitor by using a face examination image in a database and a portrait mask corresponding to the face examination image, wherein the portrait mask corresponds to the characteristic value of a portrait area in the face examination image as b, and the portrait mask corresponds to the characteristic value of a non-portrait area in the face examination image as c;

and adjusting the loss function of the portrait inhibitor until the similarity between the characteristic graph obtained by inputting the face-examination image into the portrait inhibitor and the portrait mask is greater than a threshold value.

6. The method of claim 1, wherein the scene feature classification model comprises a feature extractor, a pooling layer, and a full-connected layer, and the image feature extractor is a two-dimensional convolutional neural network for extracting scene features.

7. The method of claim 6, wherein the classifying the scene feature map through a scene classification model to obtain the scene type of the review video comprises:

extracting, by the feature extractor, the scene feature from the scene feature map;

sampling and compressing the scene features through the pooling layer to obtain dimension-reduced scene features;

inputting the dimension reduction scene features to the full-connection layer to obtain a prediction vector of the scene features;

and obtaining the scene type of the face-up video according to the preset corresponding relation between the prediction vector and the scene type.

8. A scene classification apparatus for a review video, comprising:

the acquisition unit is used for acquiring at least one video image from the face examination video;

the first processing unit is used for processing the at least one video image to obtain an identification image for scene classification;

the extraction unit is used for carrying out image feature extraction on the identification image to obtain a feature map of the identification image;

a second processing unit, configured to perform image feature processing on the identification image, control an average feature value of a portrait area in the identification image to be (0, a), and control an average feature value of a non-portrait area in the identification image to be [ a, 1), so as to obtain a portrait suppression map of the identification image;

the multiplication unit is used for multiplying the feature map of the identification image with the portrait suppression map to obtain a scene feature map;

and the classification unit is used for classifying the scene characteristic graph through a scene classification model to obtain the scene type of the face examination video.

9. An apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the scene classification method for a review video according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the scene classification method of a review video according to any one of claims 1 to 7.