CN114429612A

CN114429612A - Scene recognition method and device, computer equipment and storage medium

Info

Publication number: CN114429612A
Application number: CN202011094088.XA
Authority: CN
Inventors: 朱朝
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2022-05-03

Abstract

The application relates to a scene recognition method, a scene recognition device, computer equipment and a storage medium. The method comprises the following steps: acquiring a video stream corresponding to a scene to be identified; performing track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be recognized; when the track detection result is characterized in that a track exists, randomly extracting video frames from the video stream and inputting the video frames into the trained scene classification model to obtain a scene classification result corresponding to the video stream; and determining a scene recognition result according to the scene classification result corresponding to the video stream. By adopting the method, the scene classification result of the scene to be recognized can be obtained, and accurate scene recognition is realized, so that the compliance scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

Description

Scene recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a scene recognition method and apparatus, a computer device, and a storage medium.

Background

Along with the development of computer technology, violence sorting identification technology appears in the logistics field, and violence sorting identification technology is mainly used for accurately and quickly screening out whether violent sorting behaviors exist in application scenes such as storehouses and the like so as to accurately and timely remind and guide.

In the conventional technology, when violence sorting identification is performed, violence sorting identification is mainly performed on a scene image shot by a camera for scene monitoring.

However, in the conventional technology, since it is necessary to perform violent sorting recognition on scene images captured by all cameras for scene monitoring, there is a problem that recognition efficiency is low.

Disclosure of Invention

In view of the above, it is necessary to provide a scene recognition method, apparatus, computer device and storage medium capable of improving the efficiency of violent sorting recognition.

A method of scene recognition, the method comprising:

acquiring a video stream corresponding to a scene to be identified;

performing track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be recognized;

when the track detection result is characterized in that a track exists, randomly extracting video frames from the video stream and inputting the video frames into the trained scene classification model to obtain a scene classification result corresponding to the video stream;

and determining a scene recognition result according to the scene classification result corresponding to the video stream.

In one embodiment, before performing trajectory detection on a video stream according to a trained target detection model to obtain a trajectory detection result corresponding to a scene to be recognized, the method further includes:

acquiring a sample fusion image carrying violent sorting track labels;

and training the initial target detection model according to the sample fusion image to obtain a trained target detection model.

In one embodiment, performing track detection on a video stream according to a trained target detection model, and obtaining a track detection result corresponding to a scene to be recognized includes:

carrying out image fusion on each video frame in the video stream to obtain a fusion image corresponding to a scene to be identified;

inputting the fusion image into a trained target detection model for track detection to obtain a track detection result to be screened;

and carrying out threshold value screening on the track detection result to be screened to obtain a track detection result corresponding to the scene to be identified.

In one embodiment, the threshold value screening of the track detection result to be screened to obtain the track detection result corresponding to the scene to be identified includes:

screening the track detection frames to be screened in the track detection results to be screened according to a preset detection frame threshold value to obtain alternative track detection frames corresponding to the scenes to be identified;

performing IOU (Intersection over Unit) threshold brushing on the alternative track detection frame according to the confidence coefficient of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be identified;

and obtaining a track detection result corresponding to the scene to be identified according to the target track detection frame.

In one embodiment, before randomly extracting a video frame from a video stream and inputting the video frame into a trained scene classification model to obtain a scene classification result corresponding to the video stream, the method further includes:

acquiring a classification sample image carrying a classification label, wherein the classification label comprises a flower screen or a black screen, an included object and an excluded object;

and training the initial scene classification model according to the classification sample image to obtain a trained scene classification model.

In one embodiment, determining the scene recognition result according to the scene classification result corresponding to the video stream comprises:

and when the scene classification result corresponding to the video stream is the object, updating the preset statistical value, returning to the step of acquiring the video stream corresponding to the scene to be identified until the preset statistical value is equal to the preset time threshold value, and determining the scene identification result as the scene compliance.

In one embodiment, the scene recognition method further includes:

periodically extracting a video frame to be identified corresponding to a scene to be identified;

inputting the video frame to be recognized into the trained scene classification model to obtain a scene classification result corresponding to the video frame to be recognized;

and when the scene classification result corresponding to the video frame to be identified is a screen splash or a screen blackness, obtaining a scene identification result as scene non-compliance.

A scene recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video stream corresponding to a scene to be identified;

the track detection module is used for carrying out track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be identified;

the classification module is used for randomly extracting video frames from the video stream and inputting the video frames into the trained scene classification model when the track detection result is characterized by the existence of the track, so as to obtain a scene classification result corresponding to the video stream;

and the processing module is used for determining a scene recognition result according to the scene classification result corresponding to the video stream.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a video stream corresponding to a scene to be identified;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a video stream corresponding to a scene to be identified;

According to the scene recognition method, the scene recognition device, the computer equipment and the storage medium, the trained target detection model is used for track detection of the video stream corresponding to the scene to be recognized, on the basis that the track detection result is characterized to be that a track exists, the trained scene classification model is further used for scene classification of the video frames randomly extracted from the video stream, the scene classification result of the scene to be recognized can be obtained, accurate scene recognition is achieved, therefore, the compliant scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for scene recognition in one embodiment;

FIG. 2 is a diagram of a method for scene recognition in one embodiment;

FIG. 3 is a flow chart illustrating a scene recognition method according to another embodiment;

FIG. 4 is a block diagram showing the structure of a scene recognition apparatus according to an embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a scene recognition method is provided, and this embodiment is illustrated by applying the method to a server, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 102, obtaining a video stream corresponding to a scene to be identified.

The scene to be identified refers to a scene monitored by a camera which corresponds to the scene and is used for monitoring the scene. For example, the scene to be recognized may specifically be a warehouse or the like. The video stream refers to video data collected by a camera corresponding to a scene to be recognized.

Specifically, the server may obtain a video stream corresponding to the scene to be identified from a camera corresponding to the scene to be identified. Furthermore, a camera corresponding to the scene to be identified collects video streams in real time, the collected video streams are periodically sent to a preset database, and the server can directly acquire the video streams from the preset database.

And 104, performing track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be recognized.

The target detection model is a model for detecting a target, and in the present application, the target is an object motion trajectory. The track detection result corresponds to whether a track meeting the preset requirement exists in the video stream or not, when the track meeting the preset requirement exists in the video stream, the track detection result comprises a detection frame and confidence degrees of the detection frame, the track detection result is characterized as the existence of the track, when the track meeting the preset requirement does not exist in the video stream, the track detection result is empty, and the track detection result is characterized as the absence of the track. The preset requirements can be set according to the needs.

Specifically, the server performs image fusion on each video frame in the video stream according to a time sequence to obtain a fused image, then performs track detection on the fused image by using a trained target detection model, if a track exists in the fused image, the target detection model marks a plurality of detection frames and confidence degrees of the detection frames on the fused image, and screens the detection frames according to the confidence degrees of the detection frames to obtain a track detection result.

Specifically, when performing image fusion, the server may obtain an image optical flow trajectory by using each video frame, and the image fusion mode is not limited in the present application. Taking the example of obtaining the optical flow trajectory of the image by using each video frame as an example, assuming that the video frames of the continuous 6 frames are fused, the server will obtain the video frames of the continuous 6 frames first, convert the video frames of the continuous 6 frames into a gray map, then divide the first two-pixel addition by 2, divide the third four-pixel addition by 2, divide the fifth six-pixel addition by 2, and take the above three results as the RGB three channels of the fused map finally.

Specifically, the target detection network may be a Yolov3 detection network, and the target detection network is not limited herein. Taking Yolov3 detection network as an example for explanation, the Yolov3 detection network firstly takes darknet-53 as a feature extraction network to extract features, then uses continuous 3 × 3 and 1 × 1 convolutional layers and a plurality of quick connections to mainly predict four frame coordinates, frame confidence and possibility of each category (in the application, the category mainly refers to a track), and the Yolov3 predicts three feature maps, wherein the size of the first feature map is 13 × 13, the size of the second feature map is 26 × 26, the third feature map is 52 × 52, pixels of each feature map predict 3 detection frames, the perception approximate fields of three output layers of Yolov3 are (85 × 85, 181 × 181, 365 × 365), and the three feature maps can detect objects with different sizes, and only one layer of prediction is needed when the sizes of the objects are different, so that the number of network parameters can be reduced.

And 106, when the track detection result represents that a track exists, randomly extracting video frames from the video stream and inputting the video frames into the trained scene classification model to obtain a scene classification result corresponding to the video stream.

Specifically, when the track detection result represents that a track exists, the server randomly extracts video frames from the video stream, deducts the video frames according to a detection frame in the track detection result to obtain an image to be input, and inputs the image to be input into a trained scene classification model to obtain a scene classification result corresponding to the video stream. The trained scene classification model can be used for carrying out scene recognition on an image to be input, possible scene categories comprise objects and do not comprise the objects, and besides the fact that whether the objects are contained or not is classified, the scene classification model can also be used for classifying whether a video frame is a screen-blooming type or a screen-blacking type. Wherein, including the object means including express mail, parcel, etc.

Specifically, the scene classification model is not limited herein, and preferably, the scene classification model may be an EfficientNet-d0 classification model, which is described by taking the EfficientNet-d0 classification model as an example, when the track detection result represents that a track exists, the server may perform picture interception on a video frame extracted randomly according to a detection frame in the track detection result to obtain an image to be input, and then input the image to be input into the trained EfficientNet-d0 classification model to obtain a scene classification result corresponding to the video stream. The EfficientNet-d0 classification model has higher accuracy and higher efficiency for the same size input than other existing convolutional neural networks on ImageNet, and can reduce the parameter size and the FLOPS order of magnitude.

And 108, determining a scene recognition result according to the scene classification result corresponding to the video stream.

The scene recognition result comprises scene compliance and scene non-compliance, the scene compliance means that the scene to be recognized meets requirements, and scene images shot by the corresponding cameras need to be violently sorted and recognized in the scenes related to throwing and sorting. The scene non-compliance means that the scene to be identified is not in compliance with the requirements, the throwing and sorting are not involved, and the violent sorting and identification of the scene images shot by the corresponding cameras are not needed.

Specifically, when the scene classification result corresponding to the video stream does not contain an object, the server may directly determine that the scene identification result is scene non-compliance, when the scene classification result corresponding to the video stream contains an object, the server needs to count the scene classification result, then returns to the step of acquiring the video stream corresponding to the scene to be identified, and further performs judgment, when the statistical value of the counted scene classification result that contains an object reaches a preset number threshold, the scene identification result is determined to be scene compliance, otherwise, the scene identification result is determined to be scene non-compliance.

According to the scene recognition method, the trained target detection model is used for track detection of the video stream corresponding to the scene to be recognized, on the basis that the track detection result is characterized to be that a track exists, the trained scene classification model is further used for scene classification of the video frames randomly extracted from the video stream, the scene classification result of the scene to be recognized can be obtained, accurate scene recognition is achieved, therefore, the compliant scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

acquiring a sample fusion image carrying violent sorting track labels;

The violent sorting track marking refers to marking a track of an object subjected to violent sorting on the sample fusion image. The sample fusion image is an image obtained by fusing continuous video frames in a sample video stream, the sample video stream is video data for determining whether violent sorting operation exists, and the sample video stream, the sample fusion image and the like can be stored in a preset database in advance.

Specifically, the server may obtain a sample fusion image carrying a violent sorting trajectory annotation from a preset database, perform supervised training on the initial target detection model by using the sample fusion image as input and using the violent sorting trajectory annotation carried by the sample fusion image as a label, and obtain a trained target detection model. Taking the target detection model as the Yolov3 detection network for example, preferably, the server may change the size of the picture to 608X416, perform back propagation iterative training by using 32 sample fused images as a training set, and the Yolov3 detection network may output predictions of multiple box, where each box needs to perform frame coordinate loss, frame confidence loss, and category classification loss calculation.

In this embodiment, the training of the target detection model can be realized by acquiring the sample fusion image carrying the violent sorting track annotation and training the initial target detection model according to the sample fusion image to obtain the trained target detection model.

The track detection result to be screened refers to a detection result output by the target detection model after track detection is carried out according to the fused image. The detection result of the track to be screened comprises a plurality of detection frames and confidence degrees of the detection frames.

Specifically, during track detection, the server performs image fusion on each video frame in the video stream to obtain a fusion image corresponding to a scene to be recognized, then inputs the fusion image into a trained target detection model for track detection, the target detection model outputs a plurality of detection frames and confidence degrees of the detection frames according to the fusion image, the server takes the plurality of detection frames output by the target detection model and the confidence degrees of the detection frames as track detection results to be screened, and performs threshold screening on the track detection results to be screened to obtain the track detection results corresponding to the scene to be recognized. The screening mainly refers to screening by utilizing preset detection frame threshold values, IOU threshold values and confidence degrees of the detection frames, the detection frames with the confidence degrees larger than the preset detection frame threshold values are screened out through the screening mode, accurate track detection is achieved, and the preset detection frame threshold values can be set according to needs.

In the embodiment, each video frame in the video stream is subjected to image fusion, the fusion image is input into the trained target detection model for track detection, a track detection result to be screened is obtained, the track detection result to be screened is subjected to threshold screening, a track detection result corresponding to a scene to be identified is obtained, and an accurate track detection result can be obtained.

performing IOU threshold value brushing on the alternative track detection frame according to the confidence coefficient of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be identified;

Specifically, as shown in fig. 2, the server may first screen the trajectory detection frame to be screened according to a preset detection frame threshold (i.e., the score threshold in fig. 2) and the confidence of the trajectory detection frame to be screened in the trajectory detection result to be identified, to obtain an alternative trajectory detection frame (i.e., the remaining frame in fig. 2) corresponding to the scene to be identified, then perform IOU threshold selection on the alternative trajectory detection frame according to the confidence of the alternative trajectory detection frame, determine a target trajectory detection frame (i.e., the result with the largest score prediction value in fig. 2), and use the target trajectory detection frame as the trajectory detection result corresponding to the scene to be identified. Preferably, the fractional value threshold may be 0.2.

Specifically, the mode of performing the IOU threshold brushing may be: sorting the alternative track detection frames according to the confidence degrees of the alternative track detection frames, selecting the alternative track detection frame with the highest score, traversing the rest of the alternative track detection frames according to the alternative track detection with the highest score, deleting the detection frame if the overlapping area of the detection frame of the alternative track detection frame with the highest score in the rest of the alternative track detection frames is larger than a preset proportional threshold, continuously selecting the alternative track detection frame with the highest score from the rest of the alternative track detection frames which are not deleted, returning to the step of traversing the rest of the alternative track detection frames according to the alternative track detection with the highest score until the brushing of all the alternative track detection frames is finished, and obtaining the target track detection frame corresponding to the scene to be recognized. The preset proportion threshold value can be set according to needs, and in this way, redundant detection frames existing in the candidate track detection frames can be brushed and selected.

In this embodiment, by screening the trajectory detection frame to be screened according to the preset detection frame threshold and the IOU threshold, the redundant detection frame can be swiped on the premise of reducing the calculation amount of the IOU threshold swiped.

Specifically, the server obtains a classification sample image carrying a class label from a preset database, performs supervised training on the initial scene classification model by using the classification sample image as input and the class label as a label, and obtains a trained scene classification model through back propagation. Furthermore, during training, all classified sample images can be adjusted to be uniform in size, and then a training set for training is expanded through common data augmentation modes such as rotation, translation and noise addition.

The classification sample image with the class label containing the object can be specifically an image obtained after image deduction is carried out on the sample fusion image according to the carried violence sorting track label, and the image deduction mode can be as follows: and according to the original image before the fusion of the sample fusion image corresponding to the violent sorting track mark, deducting an image area corresponding to the violent sorting track mark, wherein the image area contains objects such as cargos and the like which are thrown away. The classification sample image with the class label not containing the object can be obtained by deducting the image from the sample fusion image, but only an image area which does not correspond to the violent sorting track mark is deducted at this time.

In this embodiment, the training of the scene classification model can be realized by acquiring a classification sample image carrying a classification label, wherein the classification label includes a screen pattern or a screen black, contains an object, and does not contain an object, and training the initial scene classification model according to the classification sample image to obtain a trained scene classification model.

Specifically, when the scene classification result corresponding to the video stream contains an object, the scene to be identified may be a compliant scene, the server updates the preset statistical value, returns to the step of acquiring the video stream corresponding to the scene to be identified, continues to perform scene identification on the scene to be identified until the preset statistical value is equal to the preset frequency threshold, and determines that the scene identification result is scene compliant, and if the preset statistical value is smaller than the preset frequency threshold within the preset statistical time, the server determines that the scene identification result is scene non-compliant. The preset statistical value can be set by itself as required, for example, the initial value of the preset statistical value can be 0, and the preset statistical time can also be set by itself as required, for example, the preset statistical time can be specifically one day.

In this embodiment, the step of obtaining the video stream corresponding to the scene to be identified is returned by updating the preset statistical value until the preset statistical value is equal to the preset number threshold, and the scene identification result is determined to be the scene compliance, so that accurate scene identification can be realized.

In one embodiment, the scene recognition method further includes:

Specifically, during scene recognition, the server also periodically extracts a video frame to be recognized corresponding to the scene to be recognized, inputs the video frame to be recognized into the trained scene classification model, obtains a scene classification result corresponding to the video frame to be recognized, and directly obtains a conclusion that the scene recognition result is not compliant when the scene classification result corresponding to the video frame to be recognized is a screen splash or a screen blackness. The extraction period can be set according to the requirement.

In this embodiment, the video frames to be recognized corresponding to the scenes to be recognized are periodically extracted, the video frames to be recognized are input into the trained scene classification model, so as to obtain the scene classification results corresponding to the video frames to be recognized, when the scene classification results corresponding to the video frames to be recognized are screen-blooming or screen-blacking, the obtained scene recognition results are scene-noncompliance, and the scene recognition can be realized by using the video frames to be recognized.

In one embodiment, as shown in fig. 3, the scene recognition method of the present application is illustrated by a flowchart.

The server acquires a video stream corresponding to a scene to be recognized, acquires a sample fusion image carrying violence sorting track labels, trains an initial target detection model according to the sample fusion image to obtain a trained target detection model, performs image fusion on each video frame in the video stream to obtain a fusion image corresponding to the scene to be recognized, inputs the fusion image into the trained target detection model for track detection to obtain a track detection result to be screened (corresponding to a track detection module in FIG. 3), screens track detection frames to be screened in the track detection result to be screened according to a preset detection frame threshold value to obtain an alternative track detection frame corresponding to the scene to be recognized, performs IOU threshold value brushing on the alternative track detection frame according to confidence of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be recognized, and detects the frame according to the target track, obtaining a track detection result corresponding to a scene to be recognized (corresponding to a post-processing module in fig. 3), obtaining a classification sample image carrying a class label, wherein the class label comprises a flower screen or a black screen, an included object and no included object, training an initial scene classification model according to the classification sample image to obtain a trained scene classification model, randomly extracting a video frame from a video stream to input the trained scene classification model when the track detection result is characterized by the existence of a track (namely a track branch line generated in fig. 3), obtaining a scene classification result corresponding to the video stream (corresponding to a classification module in fig. 3), updating a preset statistic value when the scene classification result corresponding to the video stream is characterized by the inclusion of an object (corresponding to an object including a express, a parcel and the like in fig. 3), and returning to the step of obtaining the video stream corresponding to the scene to be recognized until the preset statistic value is equal to a preset number threshold value (corresponding to a certain number in fig. 3), and determining the scene recognition result as the scene compliance. Meanwhile, the server periodically extracts the video frames to be recognized corresponding to the scenes to be recognized, inputs the video frames to be recognized into the trained scene classification model, obtains a scene classification result (corresponding to the classification module in fig. 3) corresponding to the video frames to be recognized, and obtains the scene recognition result as scene non-compliance when the scene classification result corresponding to the video frames to be recognized is a screen-splash or screen-blackout.

It should be understood that although the steps in the flowcharts of fig. 1 and 3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 4, there is provided a scene recognition apparatus including: an acquisition module 402, a trajectory detection module 404, a classification module 406, and a processing module 408, wherein:

an obtaining module 402, configured to obtain a video stream corresponding to a scene to be identified;

a track detection module 404, configured to perform track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be identified;

a classification module 406, configured to randomly extract video frames from the video stream and input the video frames into a trained scene classification model when the track detection result indicates that a track exists, so as to obtain a scene classification result corresponding to the video stream;

the processing module 408 is configured to determine a scene recognition result according to a scene classification result corresponding to the video stream.

According to the scene recognition device, the trained target detection model is used for track detection of the video stream corresponding to the scene to be recognized, on the basis that the track detection result is characterized to be that a track exists, the trained scene classification model is further used for scene classification of the video frames randomly extracted from the video stream, the scene classification result of the scene to be recognized can be obtained, accurate scene recognition is achieved, therefore, the compliant scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

In an embodiment, the scene recognition apparatus further includes a first model training module, where the first model training module is configured to acquire a sample fusion image carrying a violent sorting trajectory annotation, and train the initial target detection model according to the sample fusion image to obtain a trained target detection model.

In an embodiment, the trajectory detection module is further configured to perform image fusion on each video frame in the video stream to obtain a fusion image corresponding to the scene to be recognized, input the fusion image into the trained target detection model to perform trajectory detection to obtain a trajectory detection result to be filtered, and perform threshold value screening on the trajectory detection result to be filtered to obtain a trajectory detection result corresponding to the scene to be recognized.

In an embodiment, the track detection module is further configured to screen a track detection frame to be screened in the track detection results to be screened according to a preset detection frame threshold to obtain an alternative track detection frame corresponding to the scene to be identified, perform IOU threshold brushing on the alternative track detection frame according to a confidence of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be identified, and obtain a track detection result corresponding to the scene to be identified according to the target track detection frame.

In one embodiment, the scene recognition device further includes a second model training module, the second model training module is configured to acquire a classification sample image carrying a class label, the class label includes a flower screen or a black screen, includes an object, and does not include an object, and train the initial scene classification model according to the classification sample image to obtain a trained scene classification model.

In an embodiment, the processing module is further configured to update the preset statistical value when the scene classification result corresponding to the video stream is an object, and return to the step of acquiring the video stream corresponding to the scene to be identified until the preset statistical value is equal to the preset number threshold, and determine that the scene identification result is the scene compliance.

In one embodiment, the scene recognition device further includes a recognition module, where the recognition module is configured to periodically extract a video frame to be recognized corresponding to the scene to be recognized, input the video frame to be recognized into the trained scene classification model, obtain a scene classification result corresponding to the video frame to be recognized, and obtain the scene recognition result as a scene non-compliance when the scene classification result corresponding to the video frame to be recognized is a flower screen or a black screen.

For the specific definition of the scene recognition device, reference may be made to the above definition of the scene recognition method, which is not described herein again. The modules in the scene recognition device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing video streams, sample fusion images and classification sample images corresponding to scenes to be identified. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a scene recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring a video stream corresponding to a scene to be identified;

According to the scene recognition computer equipment, the trained target detection model is used for track detection of the video stream corresponding to the scene to be recognized, on the basis that the track detection result is characterized to be that a track exists, the trained scene classification model is further used for scene classification of the video frames randomly extracted from the video stream, the scene classification result of the scene to be recognized can be obtained, accurate scene recognition is achieved, therefore, the compliance scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a sample fusion image carrying violent sorting track labels; and training the initial target detection model according to the sample fusion image to obtain a trained target detection model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: carrying out image fusion on each video frame in the video stream to obtain a fusion image corresponding to a scene to be identified; inputting the fusion image into a trained target detection model for track detection to obtain a track detection result to be screened; and carrying out threshold value screening on the track detection result to be screened to obtain a track detection result corresponding to the scene to be identified.

In one embodiment, the processor, when executing the computer program, further performs the steps of: screening the track detection frames to be screened in the track detection results to be screened according to a preset detection frame threshold value to obtain alternative track detection frames corresponding to the scenes to be identified; performing IOU threshold value brushing on the alternative track detection frame according to the confidence coefficient of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be identified; and obtaining a track detection result corresponding to the scene to be identified according to the target track detection frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a classification sample image carrying a classification label, wherein the classification label comprises a flower screen or a black screen, an included object and an excluded object; and training the initial scene classification model according to the classification sample image to obtain a trained scene classification model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and when the scene classification result corresponding to the video stream is the object, updating the preset statistical value, returning to the step of acquiring the video stream corresponding to the scene to be identified until the preset statistical value is equal to the preset time threshold value, and determining the scene identification result as the scene compliance.

In one embodiment, the processor, when executing the computer program, further performs the steps of: periodically extracting a video frame to be identified corresponding to a scene to be identified; inputting the video frame to be recognized into the trained scene classification model to obtain a scene classification result corresponding to the video frame to be recognized; and when the scene classification result corresponding to the video frame to be identified is a screen splash or a screen blackness, obtaining a scene identification result as scene non-compliance.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring a video stream corresponding to a scene to be identified;

According to the scene recognition storage medium, the trained target detection model is used for track detection of the video stream corresponding to the scene to be recognized, on the basis that the track detection result is characterized to be that a track exists, the trained scene classification model is further used for scene classification of the video frames randomly extracted from the video stream, the scene classification result of the scene to be recognized can be obtained, accurate scene recognition is achieved, therefore, the compliant scene needing violent sorting recognition can be determined according to the scene classification result, and the recognition efficiency is improved.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a sample fusion image carrying violent sorting track labels; and training the initial target detection model according to the sample fusion image to obtain a trained target detection model.

In one embodiment, the computer program when executed by the processor further performs the steps of: carrying out image fusion on each video frame in the video stream to obtain a fusion image corresponding to a scene to be identified; inputting the fusion image into a trained target detection model for track detection to obtain a track detection result to be screened; and carrying out threshold value screening on the track detection result to be screened to obtain a track detection result corresponding to the scene to be identified.

In one embodiment, the computer program when executed by the processor further performs the steps of: screening the track detection frames to be screened in the track detection results to be screened according to a preset detection frame threshold value to obtain alternative track detection frames corresponding to the scenes to be identified; performing IOU threshold value brushing on the alternative track detection frame according to the confidence coefficient of the alternative track detection frame to obtain a target track detection frame corresponding to the scene to be identified; and obtaining a track detection result corresponding to the scene to be identified according to the target track detection frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a classification sample image carrying a classification label, wherein the classification label comprises a flower screen or a black screen, an included object and an excluded object; and training the initial scene classification model according to the classification sample image to obtain a trained scene classification model.

In one embodiment, the computer program when executed by the processor further performs the steps of: and when the scene classification result corresponding to the video stream is the object, updating the preset statistical value, returning to the step of acquiring the video stream corresponding to the scene to be identified until the preset statistical value is equal to the preset time threshold value, and determining the scene identification result as the scene compliance.

In one embodiment, the computer program when executed by the processor further performs the steps of: periodically extracting a video frame to be identified corresponding to a scene to be identified; inputting the video frame to be recognized into the trained scene classification model to obtain a scene classification result corresponding to the video frame to be recognized; and when the scene classification result corresponding to the video frame to be identified is a screen splash or a screen blackness, obtaining a scene identification result as scene non-compliance.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for scene recognition, the method comprising:

acquiring a video stream corresponding to a scene to be identified;

when the track detection result is characterized in that a track exists, randomly extracting video frames from the video stream and inputting the video frames into a trained scene classification model to obtain a scene classification result corresponding to the video stream;

2. The method according to claim 1, before performing track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to a scene to be recognized, further comprising:

acquiring a sample fusion image carrying violent sorting track labels;

3. The method of claim 1, wherein the performing track detection on the video stream according to the trained target detection model to obtain a track detection result corresponding to the scene to be recognized comprises:

performing image fusion on each video frame in the video stream to obtain a fusion image corresponding to a scene to be identified;

4. The method according to claim 3, wherein the threshold value screening of the track detection result to be screened to obtain the track detection result corresponding to the scene to be identified comprises:

5. The method of claim 1, further comprising, before said randomly extracting video frames from the video stream and inputting the extracted video frames into the trained scene classification model to obtain a scene classification result corresponding to the video stream:

obtaining a classification sample image carrying a class label, wherein the class label comprises a flower screen or a black screen, an included object and an excluded object;

and training an initial scene classification model according to the classification sample image to obtain a trained scene classification model.

6. The method of claim 1, wherein determining the scene recognition result according to the scene classification result corresponding to the video stream comprises:

and when the scene classification result corresponding to the video stream contains an object, updating a preset statistic value, returning to the step of acquiring the video stream corresponding to the scene to be identified until the preset statistic value is equal to a preset time threshold value, and determining the scene identification result as scene compliance.

7. The method of claim 1, further comprising:

8. A scene recognition apparatus, characterized in that the apparatus comprises:

the classification module is used for randomly extracting video frames from the video stream and inputting the video frames into the trained scene classification model when the track detection result is characterized by the existence of a track, so as to obtain a scene classification result corresponding to the video stream;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.