CN113177529A

CN113177529A - Method, device and equipment for identifying screen splash and storage medium

Info

Publication number: CN113177529A
Application number: CN202110587106.6A
Authority: CN
Inventors: 黄飞
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-07-27
Anticipated expiration: 2041-05-27
Also published as: CN113177529B

Abstract

The application discloses a method, a device, equipment and a storage medium for identifying a splash screen, and belongs to the technical field of the Internet. The method comprises the following steps: receiving a video stream sent by a video sending end; receiving a video stream sent by a video sending end; inputting video frames of a video stream into the trained flower screen recognition model, so that a plurality of feature extraction modules arranged in sequence in the flower screen recognition model perform serial processing on the video frames to obtain first feature data respectively output by the plurality of feature extraction modules; determining target fusion characteristic data based on first characteristic data respectively output by a plurality of characteristic extraction modules; inputting the target fusion characteristic data into a classification module in the screen-blooming identification model to obtain an identification result of whether the video frame has a screen-blooming or not; and if the identification result is that the target video frame has a screen splash, sending a screen splash notification to the video sending end. According to the method and the device, whether the screen is patterned in the video frame sent by the video sending end can be detected through the screen-patterned recognition model.

Description

Method, device and equipment for identifying screen splash and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying a splash screen.

Background

With the development of internet technology, video real-time transmission functions are increasingly applied to the life of people, such as video chat, video conference, live video broadcast and the like.

The function of realizing the real-time video transmission needs to comprise a video sending end, a server and a video receiving end. The video sending end can shoot videos through the camera device, the shot videos are sent to the server in a data stream mode, the server can forward the video streams to the corresponding video receiving ends after receiving the video streams sent by the video sending end, and therefore the video receiving ends can play the received video streams.

In the course of implementing the present application, the inventors found that the related art has at least the following problems:

under the influence of the performance of the video sending end, the network environment and the like, the server may have screen splash in the video sent by the video sending end, that is, problems of mosaic, incomplete picture display (for example, all partial images in the picture are red or green) and the like in some video frames exist, so that the experience of the user at the video receiving end for watching the video is also influenced. Therefore, a technique that can detect whether a screen splash exists in a video frame is needed.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for identifying a screen splash, which can identify video frames with screen splash in a video stream. The technical scheme is as follows:

in one aspect, a method of identifying a screensaver is provided, the method comprising:

receiving a video stream sent by a video sending end;

inputting video frames of the video stream into a trained flower screen recognition model, so that a plurality of feature extraction modules arranged in sequence in the flower screen recognition model perform serial processing on the video frames to obtain first feature data output by the plurality of feature extraction modules respectively;

determining target fusion feature data based on first feature data respectively output by the plurality of feature extraction modules;

inputting the target fusion characteristic data into a classification module in the screen-splash identification model to obtain an identification result of whether the video frame has a screen-splash or not;

and if the identification result shows that the target video frame has a screen splash, sending a screen splash notification to the video sending end.

Optionally, the determining target fusion feature data based on the first feature data output by the feature extraction modules respectively includes:

and inputting the first feature data output by each pre-specified target feature extraction module in the plurality of feature extraction modules into the fusion module of the screen-patterned recognition model for fusion processing to obtain target fusion feature data.

Optionally, the inputting the first feature data output by each pre-specified target feature extraction module in the plurality of feature extraction modules into the fusion module of the screen-splash recognition model for fusion processing to obtain target fusion feature data includes:

performing feature mapping on the first feature data output by each pre-specified target feature extraction module to obtain multiple groups of second feature data with the same dimension;

and performing fusion processing on the multiple groups of second feature data with the same dimension to obtain target fusion feature data.

Optionally, the pre-specified target feature extraction modules include: a first feature extraction module arranged at the last position, a second feature extraction module arranged at the second last position and a third feature extraction module arranged at the third last position in the plurality of feature extraction modules which are sequentially arranged;

the fusion processing of the multiple groups of second feature data with the same dimension to obtain target fusion feature data includes:

performing feature fusion on second feature data corresponding to the third feature extraction module and second feature data corresponding to the first feature extraction module to obtain first fusion feature data;

performing feature fusion on second feature data corresponding to a second feature extraction module and second feature data corresponding to the first feature extraction module to obtain second fusion feature data;

and performing feature splicing on the first fusion feature data and the second fusion feature data to obtain target fusion feature data.

Optionally, before inputting the video frames of the video stream into the trained flower screen recognition model, the method further includes:

acquiring a plurality of sample video frame sets;

training the screen-blooming recognition model based on the sample video frames in each sample video frame set, and determining the recognition accuracy rate corresponding to the screen-blooming recognition model after each training;

and when the identification accuracy rate corresponding to the screen splash identification model reaches a preset accuracy rate threshold value, stopping training the screen splash identification model to obtain the trained screen splash identification model.

Optionally, the sample video frame set includes a training subset and a verification subset; the training the screen-blooming recognition model based on the sample video frames in each sample video frame set, and determining the recognition accuracy corresponding to the screen-blooming recognition model after each training, includes:

for each sample video frame set, respectively inputting each first sample video frame included in a training subset corresponding to the sample video frame set into a to-be-trained screen-splash identification model to obtain an identification result corresponding to each first sample video frame, and determining a training indication value corresponding to the training subset based on the identification result corresponding to each first sample video frame and a preset training function; training the pattern recognition model based on the training indicated value to obtain a trained pattern recognition model;

and respectively inputting each second sample video frame included in the verification subset corresponding to the sample video frame set into the trained screen-blooming identification model to obtain an identification result corresponding to each second sample video frame, and determining the identification accuracy corresponding to the trained screen-blooming identification model based on the identification result corresponding to each second sample video frame.

Optionally, after obtaining the trained screen-splash recognition model, the method further includes:

and carrying out model compression treatment on the trained flower screen recognition model to obtain a flower screen recognition model after compression treatment.

In another aspect, there is provided an apparatus for recognizing a screen-splash, the apparatus including:

the receiving unit is used for receiving the video stream sent by the video sending end;

the processing unit is used for inputting the video frames of the video stream into the trained flower screen recognition model so that the video frames are processed in series by the multiple feature extraction modules sequentially arranged in the flower screen recognition model to obtain first feature data respectively output by the multiple feature extraction modules; determining target fusion feature data based on first feature data respectively output by the plurality of feature extraction modules; inputting the target fusion characteristic data into a classification module in the screen-splash identification model to obtain an identification result of whether the video frame has a screen-splash or not;

and the sending unit is used for sending a screen-splash notice to the video sending end if the identification result indicates that the target video frame has a screen-splash.

Optionally, the processing unit is configured to:

optionally, the processing unit is configured to:

Optionally, the apparatus further comprises a training unit, configured to:

acquiring a plurality of sample video frame sets;

Optionally, the sample video frame set includes a training subset and a verification subset; the training unit is configured to:

Optionally, the apparatus further comprises a compressing unit, configured to:

In yet another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the operations performed by the method for identifying a screenful.

In yet another aspect, a computer-readable storage medium is provided, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the method for identifying a screenout as described above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the video frame sent by the video sending end is identified through the trained screen-blooming identification model, so that whether a screen-blooming exists in the video frame sent by the video sending end can be determined, if the screen-blooming problem exists, a screen-blooming notice can be sent to the corresponding video sending end, and therefore a user of the video sending end can be reminded to adjust the video sending end, and the quality of the sent video frame is improved. The method and the device can identify the condition that the screen is patterned in the video frame through the screen-patterned identification model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for screen splash identification provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a flower screen recognition model provided by an embodiment of the present application;

FIG. 4 is a flowchart of a method for screen splash identification provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for screen splash identification provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the method for identifying a splash screen provided by the present application may be implemented by a server. The server may be a single server or a server group, and is configured to receive a video stream sent by a video sending end and forward the video stream to a video receiving end when video real-time transmission is implemented.

For example, the video real-time transmission function is applied to live video, the video sending end can be a mobile phone with a main broadcast for live broadcast, the mobile phone with the main broadcast for live broadcast can shoot the main broadcast through a mobile phone camera, the shot live video is sent to a server in a video stream mode, the server receives the live video stream sent by the main broadcast mobile phone, and the live video stream can be sent to a terminal for watching the live video of the main broadcast.

According to the method for identifying the screen splash, the server can acquire the video frames in the video stream after receiving the video stream sent by the video sending end, and then perform screen splash identification on the acquired video frames, so that whether the video frames with screen splash possibly exist in the video stream sent by the video sending end is determined.

Fig. 2 is a flowchart of a method for identifying a screen splash according to an embodiment of the present disclosure. Referring to fig. 2, the embodiment includes:

step 201, receiving a video stream sent by a video sending end.

In implementation, the server may receive a video stream sent by the video sending end and forward the received video stream to the video receiving end. For example, in a live video scene, when a main broadcast is live, a mobile phone (video sending end) of the main broadcast can shoot a live video picture of the main broadcast, send the shot live video to a server in a video stream form, and forward the corresponding video stream to a terminal (video receiving end) used by a user watching a live broadcast room of the main broadcast by the server. When the server forwards the corresponding video stream to a terminal used by a user watching a main broadcast live broadcast room, stream pulling processing can be performed, and the received video stream is cached. For example, the video stream corresponding to the anchor may be periodically cached.

Step 202, inputting video frames of the video stream into the trained flower screen recognition model, so that the plurality of feature extraction modules sequentially arranged in the flower screen recognition model perform serial processing on the video frames to obtain first feature data respectively output by the plurality of feature extraction modules.

The method and the device can be applied to the video real-time transmission function, so that the speed of obtaining the corresponding recognition result of the video frame by the pattern recognition model has higher requirements. Considering that the network parameter amount of ResNet-18 (a neural network structure) is small, and the ResNet-18 neural network structure is residual connection, the corresponding feedforward speed is fast. Therefore, in the present application, the pattern recognition model can be trained on the basis of the ResNet-18 neural network structure, wherein the training process is not described here. It should be further noted that the ResNet-18 neural network structure is only one optional neural network structure in the present application, and serves as an exemplary illustration, and a skilled person may also choose to implement the flower screen recognition model by using other types of neural network structures, and the neural network structure implementing the flower screen recognition model is not limited herein. In addition, if the method is applied to a live broadcast scene, live broadcast video streams which may need to be identified may be more, and video frames in live broadcast video streams corresponding to different anchor broadcasts can be identified in batches according to a preset period.

Before the video frames are input into the trained flower screen recognition model, preprocessing can be further performed on the video frames, namely, the resolution, the size and the like of the video frames are adjusted to graphic parameters recognizable by the flower screen recognition model, wherein the resolution and the size of the specific video frames can be set by technicians according to the applied model, and the method is not limited here.

The pattern screen recognition model can comprise a feature extraction layer, a feature mapping layer, a feature intersection layer, a feature aggregation layer, a full connection layer and the like. The feature extraction layer can comprise a plurality of feature extraction modules which are arranged in sequence, the feature intersection layer comprises a fusion module, and the full-connection layer comprises a classification module. The feature data input by any one of the feature extraction modules except the first feature extraction module is the feature data output by the last feature extraction module adjacent to the any one feature extraction module. After the video frame is input into the flower screen recognition model, the video frame can be processed in series by the plurality of feature extraction modules arranged in sequence to obtain first feature data output by the plurality of feature extraction modules respectively.

In implementation, after a video frame is acquired, the video frame may be input to a feature extraction module arranged at the first position among a plurality of feature extraction modules arranged in sequence in a feature extraction layer for feature extraction, and then feature data extracted by the feature extraction module at the first position is input to a feature extraction module arranged at the second position for feature extraction, and so on until feature data corresponding to the video frame is output by the feature extraction module arranged at the last position. Fig. 3 is a flow chart illustrating a process of inputting a video frame into a flower screen recognition model, wherein the flower screen recognition model in fig. 3 can be trained on the basis of a ResNet-18 neural network structure, and 15 convolutional layers can be included in the ResNet-18 neural network, wherein each convolutional layer can be regarded as a feature extraction module. After serial processing, the first feature data which are correspondingly output can be obtained for each feature extraction module.

Step 203, determining target fusion feature data based on the first feature data respectively output by the plurality of feature extraction modules.

In implementation, the first feature data output by each of the plurality of feature extraction modules may be obtained, and the first feature data output by each of the plurality of feature extraction modules is subjected to feature fusion processing to obtain target fusion feature data, for example, the first feature data may be mapped to features with the same channel dimension, and then the features with the same channel dimension are spliced to one block to obtain the target fusion feature data.

Optionally, the first feature data output by each target feature extraction module pre-specified in the plurality of feature extraction modules may be input into the fusion module of the flower screen recognition model for fusion processing, so as to obtain target fusion feature data.

The target feature extraction module can be preset by technicians and can be any of a plurality of feature extraction modules. For each pre-designated target feature extraction module, after the target feature extraction module outputs the first feature data extracted by the target feature extraction module, the first feature data can be input into the corresponding next feature extraction module, and the first feature data can be input into the fusion module together for fusion processing. The corresponding processing may be as follows:

performing feature mapping on first feature data output by each pre-specified target feature extraction module to obtain multiple groups of second feature data with the same dimension; and carrying out fusion processing on the multiple groups of second feature data with the same dimensionality to obtain target fusion feature data.

In implementation, because different feature extraction modules have differences in convolution processing and the like of input feature data, dimensions (i.e., channel dimensions) of feature data output by different feature extraction modules may be different. Therefore, before the first feature data output by each target feature extraction module is fused, each first feature data may be adjusted to the same channel dimension, for example, feature mapping may be performed on multiple sets of first feature data, for example, feature mapping may be performed on the multiple sets of first feature data along the channel dimension using a convolution kernel of 1 × 1, so that each first feature data is mapped to a higher channel dimension, to obtain multiple sets of feature data with the same channel dimension, and the feature data after mapping the first feature data may be referred to as second feature data. After the second feature data with the same channel dimension is obtained, the multiple groups of second feature data can be subjected to fusion processing, so that feature data after fusion processing, namely target fusion feature data, can be obtained.

In the present application, each target feature extraction module that may be specified in advance includes: for the case that the first feature extraction module ranked last, the second feature extraction module ranked second to last, and the third feature extraction module ranked third to last in the plurality of sequentially arranged feature extraction modules, the corresponding fusing of the plurality of second feature data may include:

performing feature fusion on second feature data corresponding to the third feature extraction module and second feature data corresponding to the first feature extraction module to obtain feature data after feature fusion (the feature data can also be called as first fusion feature data); performing feature fusion on second feature data corresponding to the second feature extraction module and second feature data corresponding to the first feature extraction module to obtain feature data after feature fusion (the feature data can also be called as second fusion feature data); and performing feature splicing on the first fusion feature data and the second fusion feature data to obtain feature data after fusion processing (the feature data can also be called target fusion feature data).

The second feature data corresponding to the third feature extraction module is feature data obtained by performing feature mapping on the first feature data output by the third feature extraction module; similarly, the second feature data corresponding to the first feature extraction module is feature data obtained by performing feature mapping on the first feature data output by the first feature extraction module; the second feature data corresponding to the second feature extraction module is feature data obtained by performing feature mapping on the first feature data output by the second feature extraction module.

In the screen-splash recognition model provided in the present application, there are a plurality of feature extraction modules arranged in sequence, and the input of each feature extraction module except for the first feature extraction module may be the output of the last feature extraction module. Therefore, the feature data extracted by the feature extraction model that is earlier in the arrangement order is shallow feature data such as color feature, edge feature, and the like, and the feature data extracted by the feature extraction model that is later in the arrangement order is deeper and more abstract feature data. Therefore, in the application, the feature data output by the plurality of feature extraction models can be fused, so that the feature data simultaneously having the shallow feature and the deep feature is obtained. And then classifying whether the video frame has a screen splash according to the fused feature data.

In the present application, the last three feature extraction modules in the sequence of the plurality of feature extraction modules may be determined as pre-specified feature extraction models, that is, the first feature extraction module, the second feature extraction module, and the third feature extraction module in fig. 3.

After the first feature data output by the first feature extraction module, the second feature extraction module, and the third feature extraction module is obtained, feature mapping processing may be performed on the three first feature data, and the three first feature data are mapped to second feature data with the same dimension, as shown in fig. 3, three lines output by the feature mapping layer respectively represent the three second feature data. And then inputting the mapped three second characteristic data into a fusion module for fusion processing. Wherein the fusion treatment process comprises the following steps:

and performing feature fusion on the second feature data corresponding to the third feature extraction module and the second feature data corresponding to the first feature extraction module, for example, in a feature intersection layer shown in fig. 3, feature intersection may be performed on two second feature data in a bilinear pooling manner, so as to obtain first fusion feature data. Similarly, the second feature data corresponding to the second feature extraction module and the second feature data corresponding to the first feature extraction module may be subjected to feature fusion to obtain second fusion feature data. Then, the feature splicing may be performed on the first fusion feature data and the second fusion feature data after feature fusion, for example, the first fusion feature data and the second fusion feature data are respectively subjected to global mean pooling, and then feature vectors corresponding to the first fusion feature data and the second fusion feature data after the global mean pooling are spliced, so as to obtain feature data after the fusion processing, that is, target fusion feature data.

In the application, part of the feature extraction modules can be selected as each pre-designated target feature extraction module, so that the data processing amount during feature fusion can be reduced, and the speed of obtaining the identification result by the screen-patterned identification model can be improved. And a plurality of feature extraction modules which are arranged behind one another in sequence are selected as target feature extraction modules, so that the obtained target fusion feature data can simultaneously have the shallow feature and the deep feature of the video frame, and the accuracy of the identification result obtained by the screen-patterned identification model can be improved to a certain extent. In addition, the corresponding length and width of the feature data output by the last three feature extraction modules in the ResNet-18 neural network structure are the same, so that the feature fusion can be performed only by mapping the feature data output by the last three feature extraction modules to the same channel dimension, the processing of the feature data before the feature fusion is performed can be reduced, and the speed of obtaining the identification result by the screen-patterned identification model can be improved.

And step 204, inputting the target fusion characteristic data into a classification module in the screen-blooming recognition model to obtain a recognition result of whether the video frame has a screen-blooming or not.

In implementation, after the fusion processing of the input feature data is completed in the fusion module, the target fusion feature data after the fusion processing may be input into a classification module of the full link layer, wherein the classification module may be implemented based on a softmax function. After the target fusion characteristic data after fusion processing is input into the classification module, the classification module can calculate the input characteristic data, so as to output the identification result of whether the video frame has a screen splash or not. For example, if there is a screen splash in the corresponding video frame, a 1 may be output, and if there is no screen splash in the corresponding video frame, a 0 may be output.

And step 205, if the identification result is that the target video frame has a screen splash, sending a screen splash notification to the video sending end.

In implementation, if the identification result of the target video frame indicates that the target video frame has a screen splash, the video sending end corresponding to the target video frame can be determined, and then a screen splash notification is sent to the corresponding video sending end, so that a user of the video sending end can be reminded of the problem that the screen splash exists in the currently sent video frame. Therefore, the user of the video sending end can detect the problems of the video sending end network and the like, and the quality of the sent video frame is improved.

According to the embodiment of the application, the video frame sent by the video sending end is identified through the trained screen-patterned identification model, so that whether a screen-patterned exists in the video frame sent by the video sending end can be determined, if the screen-patterned problem exists, a screen-patterned notification can be sent to the corresponding video sending end, and therefore a user of the video sending end can be reminded to adjust the video sending end, and the quality of the sent video frame is improved. The method and the device can identify the condition that the screen is patterned in the video frame through the screen-patterned identification model.

As shown in fig. 4, the present application further provides a method for training a pattern recognition model, referring to fig. 4, the method includes:

step 401, a plurality of sample video frame sets are obtained.

Wherein, each sample video frame set can comprise a sample video frame with a flower screen and a sample video frame without a flower screen. A sample video frame that includes a splash screen may be referred to as a positive sample video frame, a sample video frame that does not include a splash screen may be referred to as a negative sample video frame, and a positive sample video frame and a negative sample video frame may be collectively referred to as a sample video frame.

Taking the application scene as a live video scene as an example, the corresponding sample video frames may be obtained from live videos corresponding to each anchor, for example, live videos of a part of anchors in different time periods and different scenes may be randomly obtained, and then a certain number of video frames are randomly extracted from each live video to serve as sample video frames. After the sample video frames are obtained, the sample video frames can be annotated by a technician to determine positive and negative sample video frames.

Since the probability of occurrence of a video frame with screen splash is relatively small in practical cases, after the positive sample video frame and the negative sample video frame are obtained, a part of the negative sample video frame can be deleted, thereby increasing the occupation ratio of the positive sample video frame in all the sample video frames. The image sizes, resolutions and the like corresponding to different sample video frames are different, so that before the pattern recognition model is trained through the sample video frames, edge shape complementing operation can be performed on the sample video frames, bilinear interpolation scaling is performed according to the specified size, and therefore the sample video frames are adjusted to the same size and resolution.

After the plurality of sample video frames are obtained through the above processing, the plurality of sample video frames may be divided to obtain a plurality of sample video frame sets, where each sample video frame set may include a positive sample video frame and a negative sample video frame.

Step 402, training the pattern recognition model based on the sample video frames in each sample video frame set, and determining the recognition accuracy rate corresponding to the pattern recognition model after each training.

After obtaining the plurality of sample video frame sets, the pattern recognition model may be trained according to the plurality of sample video frame sets. Taking a sample video frame set as an example, each sample video frame included in the sample video frame set may be respectively input into a to-be-trained flower screen recognition model (the flower screen recognition model being trained may also be referred to as the to-be-trained flower screen recognition model), so as to obtain a recognition result for each sample video frame, which is respectively output by the to-be-trained flower screen recognition model. And then determining corresponding loss values according to a preset loss function and each recognition result, and training the flower screen recognition model according to the loss values. After the pattern recognition model is trained according to a sample video frame set each time, the recognition accuracy rate corresponding to the pattern recognition model can be verified to obtain the recognition accuracy rate of the pattern recognition model.

The sample video frame set can be divided into a training subset and a verification subset according to a preset proportion. Based on the sample video frames in each sample video frame set, training the pattern recognition model, and determining the recognition accuracy rate corresponding to the pattern recognition model after each training as follows:

the training subset corresponding to the sample video frame set can be determined, and then each first sample video frame included in the training subset is input into the to-be-trained flower screen recognition model to obtain a recognition result corresponding to each first sample video frame. Then, according to the recognition result corresponding to each first sample video frame and a preset training function (such as a cross entropy loss function), a training indicated value (loss value) corresponding to the training subset is determined, and according to the training indicated value corresponding to the training subset, parameters in the pattern recognition model are adjusted to obtain the trained pattern recognition model.

After the screen-splash recognition model is trained according to one training subset each time, the recognition accuracy of the trained screen-splash recognition model can be verified according to the verification subset. Specifically, each second sample video frame included in the verification subset corresponding to the sample video frame set may be input into the trained screen-splash recognition model to obtain a recognition result corresponding to each second sample video frame, and then the recognition accuracy of the trained screen-splash recognition model may be calculated according to the recognition result corresponding to each second sample video frame and the fact whether each second sample video frame actually includes a screen-splash.

And 403, stopping training the screen-splash recognition model when the recognition accuracy corresponding to the screen-splash recognition model reaches a preset recognition accuracy threshold value, and obtaining the trained screen-splash recognition model.

In the training process, after the identification accuracy corresponding to the screen splash identification model is obtained every time, whether the identification accuracy corresponding to the screen splash identification model currently being trained is larger than a preset identification accuracy threshold value or not can be determined, and when the identification accuracy corresponding to the screen splash identification model is larger than the identification accuracy threshold value, the training of the screen splash identification model can be stopped, so that the screen splash identification model after training is obtained. In addition, a sample video frame test set can be further arranged, in the training process, when the identification accuracy corresponding to the screen identification model is determined to be larger than the identification accuracy threshold, the identification accuracy of the screen identification model can be tested again according to the sample video frames in the sample video frame test set, and if the identification accuracy of the screen identification model obtained again is still larger than the identification accuracy threshold, the screen identification model can be determined to be trained.

In addition, after the pattern recognition model is trained, model compression processing can be performed on the trained pattern recognition model to obtain the compressed pattern recognition model, so that the calculated amount of the pattern recognition model in practical application can be reduced, and the efficiency of the pattern recognition model for recognizing video frames is improved. For example, the pattern recognition model may be subjected to model compression processing by methods such as knowledge distillation and model pruning.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 5 is a device for identifying a screenful, which may be a server in the above embodiment, provided in an embodiment of the present application, and referring to fig. 5, the device includes:

a receiving unit 510, configured to receive a video stream sent by a video sending end;

a processing unit 520, configured to input a video frame of the video stream into the trained flower screen recognition model, so that a plurality of feature extraction modules sequentially arranged in the flower screen recognition model perform serial processing on the video frame, so as to obtain first feature data output by each of the plurality of feature extraction modules; determining target fusion feature data based on first feature data respectively output by the plurality of feature extraction modules; inputting the target fusion characteristic data into a classification module in the screen-splash identification model to obtain an identification result of whether the video frame has a screen-splash or not;

a sending unit 530, configured to send a screen splash notification to the video sending end if the identification result indicates that the target video frame has a screen splash.

Optionally, the processing unit 520 is configured to:

optionally, the processing unit 520 is configured to:

Optionally, the apparatus further comprises a training unit, configured to:

acquiring a plurality of sample video frame sets;

Optionally, the apparatus further comprises a compressing unit, configured to:

It should be noted that: in the apparatus for identifying a splash screen according to the above embodiment, when identifying a splash screen, only the division of the functional modules is used for illustration, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for identifying a screen splash and the method for identifying a screen splash provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (cpus) 601 and one or more memories 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of identifying a splash screen in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of identifying a splash screen, the method comprising:

receiving a video stream sent by a video sending end;

2. The method according to claim 1, wherein the determining target fusion feature data based on the first feature data respectively output by the plurality of feature extraction modules comprises:

3. The method according to claim 2, wherein the inputting the first feature data output by each pre-designated target feature extraction module in the plurality of feature extraction modules into the fusion module of the screen-splash recognition model for fusion processing to obtain target fusion feature data comprises:

4. The method of claim 3, wherein the pre-specified target feature extraction modules comprise: a first feature extraction module arranged at the last position, a second feature extraction module arranged at the second last position and a third feature extraction module arranged at the third last position in the plurality of feature extraction modules which are sequentially arranged;

5. The method of claim 1, wherein prior to inputting the video frames of the video stream into the trained screenplay recognition model, the method further comprises:

acquiring a plurality of sample video frame sets;

6. The method of claim 5, wherein the set of sample video frames comprises a training subset and a validation subset; the training the screen-blooming recognition model based on the sample video frames in each sample video frame set, and determining the recognition accuracy corresponding to the screen-blooming recognition model after each training, includes:

7. The method of claim 5, wherein after obtaining the trained screenplay recognition model, the method further comprises:

8. An apparatus for identifying a screensaver, the apparatus comprising:

9. A computer device, comprising a processor and a memory, wherein at least one instruction is stored in the memory, and is loaded and executed by the processor to perform operations performed by the method for identifying a splash screen according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the method of identifying a screensaver of any one of claims 1 to 7.