CN111432206A

CN111432206A - Video definition processing method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN111432206A
Application number: CN202010334489.1A
Authority: CN
Inventors: 杨天舒; 黄嘉文; 沈招益
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Technology Beijing Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-07-17

Abstract

The invention provides a video definition processing method and device based on artificial intelligence and electronic equipment; the method comprises the following steps: extracting a plurality of image frames to be identified from a video; carrying out definition recognition on the foreground in the plurality of image frames to obtain the foreground definition of each image frame; determining a first video definition of the video based on the foreground definition of each image frame, and using the first video definition as a definition identification result of the video; when the first video definition of the video does not meet the definition condition, performing definition recognition on the backgrounds of the image frames to obtain the background definition of each image frame; determining a second video sharpness of the video based on the background sharpness of each of the image frames and as an updated sharpness identification result of the video. By the method and the device, the definition of the video can be efficiently and accurately identified.

Description

Video definition processing method and device based on artificial intelligence and electronic equipment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a video definition processing method and device based on artificial intelligence and electronic equipment.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Computer vision technology is an important technical content in artificial intelligence software technology, and is rapidly developed in recent years, image recognition technology is an important branch in computer vision technology, and video definition evaluation results can be given through recognition of image frames by the image recognition technology, but scenes of videos mainly comprise static scenes, and in dynamic scenes, recognition results of video definition are not ideal.

Disclosure of Invention

The embodiment of the invention provides a video definition processing method and device based on artificial intelligence and electronic equipment, which can efficiently and accurately identify the definition of a video.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a video definition processing method based on artificial intelligence, which comprises the following steps:

extracting a plurality of image frames to be identified from a video;

carrying out definition recognition on the foreground in the plurality of image frames to obtain the foreground definition of each image frame;

determining a first video definition of the video based on the foreground definition of each image frame, and using the first video definition as a definition identification result of the video;

when the first video definition of the video does not meet the definition condition, performing definition recognition on the backgrounds of the image frames to obtain the background definition of each image frame;

determining a second video sharpness of the video based on the background sharpness of each of the image frames and as an updated sharpness identification result of the video.

The embodiment of the invention provides a video definition processing device based on artificial intelligence, which comprises:

the extraction module is used for extracting a plurality of image frames to be identified from the video;

the first identification module is used for identifying the definition of the foreground in the plurality of image frames to obtain the foreground definition of each image frame;

the first determining module is used for determining a first video definition of the video based on the foreground definition of each image frame and taking the first video definition as a definition identification result of the video;

the second identification module is used for identifying the definition of the background of the plurality of image frames when the definition of the first video of the video does not meet the definition condition, so as to obtain the background definition of each image frame;

a second determining module for determining a second video sharpness of the video based on the background sharpness of each of the image frames and as an updated sharpness identification result of the video.

In the foregoing solution, the extracting module is configured to:

performing frame extraction on the video at equal intervals to obtain a first image frame set;

clustering image frames in the first image frame set to obtain a plurality of similar image frame subsets, randomly extracting one image frame from each similar image frame subset, and combining images which are not clustered to any similar image frame subset in the first image set to form a second image frame set;

and filtering out image frames meeting the blurring condition from the second image frame set, and taking the remaining multi-frame image frames in the second image frame set as image frames to be identified.

In the foregoing solution, the first identifying module is configured to:

and mapping the image characteristics of the image frame into confidence degrees corresponding to different foreground definition categories, and taking the foreground definition category corresponding to the maximum confidence degree as the foreground definition of the image frame.

In the foregoing solution, the first determining module is configured to:

the foreground sharpness categories include: the foreground is clear, common and fuzzy;

determining the number of the image frames to be identified included in each foreground definition category based on the foreground definition category to which each image frame belongs;

and determining the first video definition of the video according to the proportion of the number of the image frames included in each foreground definition category in the total number, wherein the total number is the count of the plurality of image frames to be identified.

In the foregoing solution, the first determining module is configured to:

the foreground definition category with clear foreground corresponds to a first proportional threshold, the foreground definition category with common foreground corresponds to a second proportional threshold, the foreground definition category with fuzzy foreground corresponds to a third proportional threshold, and the second proportional threshold, the first proportional threshold and the third proportional threshold are arranged in descending order;

when the proportion of the number of image frames in the total number of the image frames included in the foreground definition category with clear foreground is larger than the first proportion threshold value, determining that the first video definition of the video is clear;

when the proportion of the number of the image frames included in the general foreground definition category in the total number is larger than the second proportion threshold value, and the proportion of the number of the image frames included in the foreground definition category with blurred foreground in the total number is smaller than a third proportion threshold value, determining that the first video definition of the video is general;

when the proportion of the number of the image frames in the total number of the foreground definition categories with clear foreground is smaller than the first proportion threshold value and the proportion of the number of the image frames in the total number of the foreground definition categories with blurred foreground is zero, determining that the first video definition of the video is general;

and when the proportion of the number of the image frames in the total number of the foreground definition categories with blurred foreground is larger than a third proportion threshold value, determining the first video definition of the video as blurred.

In the foregoing solution, the second identifying module is configured to:

performing the following processing for each of the image frames:

mapping image features of the image frame to confidence levels of different background definition categories;

wherein the background definition categories include: the background is clear and the background is blurred.

In the foregoing solution, the second determining module is configured to:

accumulating the confidence coefficients of the image frames belonging to the background definition category with clear background and averaging to obtain an average confidence coefficient;

and when the mean confidence coefficient is greater than a confidence coefficient threshold value, determining that the second video definition of the video is fuzzy, and when the mean confidence coefficient is less than or equal to the confidence coefficient threshold value, determining that the second video definition of the video is general.

In the foregoing solution, the second determining module is further configured to:

acquiring the category information of the video;

and searching the confidence coefficient threshold value corresponding to the video category information in the corresponding relation between the video categories and the confidence coefficient threshold value.

In the above solution, the apparatus for processing video sharpness based on artificial intelligence further includes: a recommendation module to: and sending the definition recognition result of the video to a recommendation system so that the recommendation system executes corresponding recommendation operation according to the definition of the video.

An embodiment of the present invention further provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video definition processing method based on artificial intelligence provided by the embodiment of the invention when the executable instructions stored in the memory are executed.

The embodiment of the invention provides a computer-readable storage medium, wherein executable instructions are stored in the computer-readable storage medium and used for realizing the video definition processing method based on artificial intelligence provided by the embodiment of the invention when being executed by a processor.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of extracting a plurality of image frames from a video, carrying out definition recognition on the foregrounds of the image frames to obtain a recognition result of video definition, carrying out definition recognition on the backgrounds of the corresponding image frames according to judgment of the recognition result to obtain an updated video definition recognition result, and being suitable for dynamic videos to realize efficient and accurate recognition, thereby improving the efficiency and precision of video definition recognition.

Drawings

FIG. 1 is a block diagram of an architecture of an artificial intelligence based video sharpness processing system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a terminal device according to an embodiment of the present invention;

FIG. 3 is a block diagram of an artificial intelligence based video sharpness processing apparatus according to an embodiment of the present invention;

FIG. 4A is a flowchart illustrating an artificial intelligence based video sharpness processing method according to an embodiment of the present invention;

FIG. 4B is a flowchart illustrating an artificial intelligence based video sharpness processing method according to an embodiment of the present invention;

FIG. 4C is a flowchart illustrating an artificial intelligence based video sharpness processing method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a foreground sharpness model provided in an embodiment of the present invention;

FIG. 6 is an architectural diagram of a recommendation system provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of two image frames extracted from a short video according to an embodiment of the present invention;

fig. 8 is a flowchart illustrating a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, to enable embodiments of the invention described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Video definition: is an important index for measuring the video quality. The definition refers to the definition of each detail shadow and its boundary on the image, so the image quality is compared by looking at the definition of the replayed image, and the definition of the video identified by artificial intelligence in the application is called the definition identification result.

2) Convolutional Neural Networks (CNN), Convolutional Neural Networks: one class of feed Forward Neural Networks (FNNs) that includes convolution calculations and has a deep structure is one of the algorithms that represent deep learning (deep). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on an input image according to a hierarchical structure of the input image.

3) Foreground, the definition of a person or object in the video that is in front of the subject or near the leading edge, is referred to as foreground definition.

4) Background, the scenery far from the camera behind the main body in the video is used for enriching the picture space content and reflecting the characteristics of places, environments, seasons, time and the like. The mood of the background is called background clarity.

In the related art, a method of determining the definition of a video determines the definition of the video according to the definition of a target frame, determines the definition of the video based on a 3D convolutional neural network (deep learning method), and determines the definition of the video based on a time series model such as a 2D convolutional neural network (deep learning method) + a long-Short Term Memory network (L STM, &lttttranslation = L "&ltt/t &gtt ong Short-Term Memory), which are respectively described below.

(1) Judging the definition of the video according to the definition of the target frame: extracting frames of video frames of the video at fixed time intervals, or filtering some transition frames through some traditional operators, and selecting target frames of the video; after the target frame is obtained, the gradient features of the target frame are extracted by using traditional operators (canny operator, sober operator, laplacian operator and the like), the weighted values of the features are calculated, and the definition of the video is obtained after the weighted values are compared with a preset threshold value.

(2) Judging the definition of the video based on a 3D convolutional neural network (deep learning method): and (3) building a 3D-resnet and other common 3D convolutional neural network models, putting the marked video data into the models for training, and finally predicting the video definition by using the trained models.

(3) And (3) judging the definition of the video based on a 2D convolutional neural network (deep learning method) + L STM and other time sequence models, namely, obtaining the characteristics of each frame by building a 2D-resnet and other common convolutional neural network models, fusing the characteristics between the frames, and predicting the definition of the video according to the fused characteristics.

In the embodiment of the present invention, the following technical problems may occur in the practical application process of the above method in the related art:

(1) the video has the characteristics of rich scenes, fast content change and the like, particularly, in some living scenes such as square dance, street skateboard and other videos with high occurrence frequency, the motion frames are easily extracted in the frame extraction process, and if the obtained target frames are all motion frames, the identification result of the target frames cannot accurately represent the definition of the whole video. Therefore, the identification result of the method completely depends on the obtained target frame, and the video definition of various types cannot be accurately identified.

(2) The methods (2) and (3) both consider the continuity between frames, and because the background processing capacity is limited in an actual service scene, the speed of identification based on a time sequence model is low through the two methods, so that the real-time processing efficiency of the background is low.

In view of the above problems, embodiments of the present invention provide a video sharpness processing method and apparatus based on artificial intelligence, and an electronic device, which can efficiently and accurately identify the sharpness of a video.

The following describes an exemplary application of the electronic device provided by the embodiment of the present invention, where the electronic device provided by the embodiment of the present invention may be a server, for example, a server deployed in a cloud, uploads a video remotely according to a terminal device, extracts a plurality of image frames from the video, performs definition recognition on foregrounds of the plurality of image frames to obtain a video definition recognition result, and performs definition recognition on a background of a corresponding image frame according to a judgment of the recognition result to obtain an updated video definition recognition result; or the video processing device may be a terminal device, for example, a handheld terminal device, and according to the video input in the terminal device, a series of processing of sharpness recognition is performed on the video to obtain a video sharpness recognition result. By operating the scheme of video definition processing based on artificial intelligence provided by the embodiment of the invention, the accuracy of video definition identification can be improved, the applicability of video definition processing in actual service scenes is enhanced, the processing efficiency of the video definition identification of the electronic equipment is improved, and the scheme is suitable for multiple application scenes, for example, a recommendation system can preferentially recommend videos with high definition.

An exemplary application of the artificial intelligence based video sharpness processing method provided by the embodiment of the present invention is described below, and referring to fig. 1, fig. 1 is an architectural schematic diagram of an artificial intelligence based video sharpness processing system 100 provided by the embodiment of the present invention. The video sharpness processing system 100 based on artificial intelligence includes: a server 200, a network 300 and a terminal device 400 (the terminal device 400-1 and the terminal device 400-2 are exemplarily shown), the terminal device 400 is connected to the server 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal device 400 is used to obtain a video sample, for example, when a user (e.g., user a and user B in fig. 1) inputs a video (e.g., selects a local video file or takes a video) through a video input interface.

In some embodiments, the terminal device 400 locally executes the artificial intelligence-based video sharpness processing method provided by the embodiment of the present invention to complete video input according to the user to obtain a video sharpness recognition result, for example, on the terminal device 400, the user opens a video input interface and inputs video in the video input interface, the terminal device 400 performs a series of processing of sharpness recognition on the video to obtain a video sharpness recognition result, and the obtained video sharpness recognition result is displayed on the video input interface 410 (the video input interface 410-1 and the video input interface 410-2 are exemplarily shown) of the terminal device 400.

In some embodiments, the terminal device 400 may also send a video input by the user on the terminal device 400 to the server 200 through the network 300, and invoke the artificial intelligence-based video sharpness processing function provided by the server 200, and the server 200 performs a series of processing of sharpness recognition on the input video through the artificial intelligence-based video sharpness processing method provided by the embodiments of the present invention to obtain a video sharpness recognition result, for example, the user opens a video input interface on the terminal device 400, inputs the video in the video input interface, the terminal device sends the video to the server 200 through the network 300, after receiving the video, the server 200 recognizes the video sharpness recognition result, returns the obtained video sharpness recognition result to the terminal device, displays the video sharpness result of the video on the display interface 410 of the terminal device 400, alternatively, the server 200 directly gives the video sharpness result of the video.

The embodiment of the invention can be widely applied to video definition processing scenes, for example, when a background of a video APP (application) and a background system check basic information (whether video content is clear or not) of video information, a strategy is formulated by combining the characteristics of the video, a plurality of image frames are extracted from the video, definition recognition is carried out on the foregrounds of the image frames to obtain a recognition result of video definition, definition recognition is carried out on the background of the corresponding image frame according to the judgment of the recognition result to obtain an updated video definition recognition result, the definition of the video is recognized efficiently and accurately, the purpose of simulating the definition of the video given by human senses is finally achieved, and the real-time processing efficiency is accelerated; the video definition processing system 100 based on artificial intelligence can also be applied to a recommendation system, and the obtained video definition result is input into the recommendation system, so that the recommendation system recommends a video with higher definition to a user to increase the video click rate and the watching time of the user, and the obtained video definition result can also be stored in a server and subsequently used by the recommendation system offline. Besides, scenes related to video definition processing belong to potential application scenes of the invention.

Next, an electronic device will be described as an example of a terminal device. Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of a terminal device 400 (for example, the terminal device 400-1 and the terminal device 400-2 shown in fig. 1) provided in an embodiment of the present invention, where the terminal device 400 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the artificial intelligence based video sharpness processing apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 2 shows an artificial intelligence based video sharpness processing apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an extraction module 4551, a first recognition module 4552, a first determination module 4553, a second recognition module 4554, a second determination module 4555 and a recommendation module 4556; the extracting module 4551, the first identifying module 4552, the first determining module 4553, the second identifying module 4554, and the second determining module 4555 are configured to implement the artificial intelligence-based video definition processing method according to the embodiment of the present invention, and the recommending module 4556 is configured to implement recommendation of a video definition recognition result according to the embodiment of the present invention, and these modules are logical, so that any combination or further splitting may be performed based on the implemented functions. The functions of the respective modules will be explained below.

The video sharpness processing method based on artificial intelligence provided by the embodiment of the present invention may be executed by the server, or may be executed by a terminal device (for example, the terminal device 400-1 and the terminal device 400-2 shown in fig. 1), or may be executed by both the server and the terminal device.

The following describes a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention, with reference to an exemplary application and implementation of a terminal device according to an embodiment of the present invention.

Referring to fig. 3 and fig. 4A, fig. 3 is a schematic block diagram of an architecture of an artificial intelligence based video sharpness processing apparatus 455 according to an embodiment of the present invention, which shows a flow of video sharpness processing implemented by a series of modules, and fig. 4A is a schematic block diagram of a method for processing artificial intelligence based video sharpness according to an embodiment of the present invention, and the steps shown in fig. 4A will be described with reference to fig. 3.

In step S101, the server extracts a plurality of image frames to be recognized from the video.

The user can input a video on an input interface of the terminal, the terminal can forward the video to the server after the input is finished, and the server can extract a plurality of image frames to be identified from the video after receiving the video, so that the definition identification result of the video can be obtained according to the image frames.

In some embodiments, referring to fig. 3, the server pair extracting a plurality of image frames to be identified from the video includes: performing frame extraction on a video at equal intervals to obtain a first image frame set; clustering image frames in the first image frame set to obtain a plurality of similar image frame subsets, randomly extracting one image frame from each similar image frame subset, and combining images which are not clustered to any similar image frame subset in the first image set to form a second image frame set; and filtering out the image frames meeting the blurring condition from the second image frame set, and taking the remaining multi-frame image frames in the second image frame set as the image frames to be identified.

As an example, the server frames the video at equal intervals to obtain the first set of image frames, which may be implemented by a multimedia video processing tool (FFMpeg, Fast Forward Mpeg). That is, after the server receives the video, the stream information in the video file is read, a corresponding decoder in the FFMpeg decoding library is called to open the stream information, the number of frames of the image is extracted every second, a plurality of video frames are decoded from the video, and a first image frame set is obtained.

As an example, the filtering process of the image frames by the server is implemented by clustering. Specifically explaining the process of clustering the image frames, projecting the first image frame set to a feature space to obtain an image feature vector corresponding to each first image frame; calculating the distance (Euclidean distance or cosine distance) between each feature vector and other feature vectors; classifying the feature vectors of which the calculated distances fall within a numerical threshold range into similar image frame categories to obtain a plurality of similar image frame subsets, and taking the feature vectors which are not clustered into any similar image frame subset as a new similar image frame category; randomly extracting an image frame from each similar image frame subset as a class of similar image frame; the image frames of all similar image frame categories are combined together to form a second set of image frames.

The image frame extraction mode is suitable for various types of videos, and fuzzy frames caused by the extraction mode can be filtered out, so that the subsequent video definition processing process can be accurately carried out.

In step S102, the server performs definition recognition on the foreground in the plurality of image frames to obtain the foreground definition of each image frame.

In some embodiments, referring to fig. 3, the performing, by the server, definition recognition on the foreground in the plurality of image frames to obtain the foreground definition of each image frame includes: performing the following processing for each image frame based on the foreground sharpness model: and mapping the image characteristics of the image frame into confidence degrees corresponding to different foreground definition categories through a forward propagation process among all layers of the foreground definition model, and taking the foreground definition category corresponding to the maximum confidence degree as the foreground definition of the image frame.

As an example, referring to fig. 3, the foreground sharpness model includes an input layer, a hidden layer, and an output layer. The server outputs the confidence coefficient of the foreground definition category to which each image frame belongs through the forward propagation process among the input layer, the hidden layer and the output layer of the foreground definition model, the foreground definition category corresponding to the maximum confidence coefficient is used as the foreground definition of the image frame, and the foreground definition category comprises: the foreground is clear, the foreground is general and the foreground is fuzzy.

For example, the foreground sharpness categories of an image frame are classified into three categories, namely foreground blur, general foreground and foreground sharpness, and the result output for one image frame is as follows: the foreground is blurred by 2%, the foreground is generally 7%, and the foreground is clear by 91%, so that the foreground definition of the image frame is clear. It is worth noting that the probability that the confidence belongs to the correct category approaches 1, the better the prediction.

Here, a forward propagation process of the foreground definition model is explained, and a process in which sample data is propagated from a low level to a high level is the forward propagation process of the foreground definition model. In the forward propagation process, the image is input into an input layer, the image features are extracted through a hidden layer, then the image enters an output layer for classification, the result of the foreground definition category is obtained, and when the output result is consistent with the expected value, the result is output.

In step S103, the server determines a first video definition of the video based on the foreground definition of each image frame, and as a definition recognition result of the video.

Referring to fig. 4B, fig. 4B is a flowchart illustrating a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention, and in some embodiments, fig. 4B shows that step S103 in fig. 4A can be implemented by steps S1031 to S1032 shown in fig. 4B.

In step S1031, the server determines the number of image frames to be identified included in each foreground definition category based on the foreground definition category to which each image frame belongs; in step S1032, the server determines a first video resolution of the video according to a ratio of the number of image frames included in each foreground resolution category to a total number, where the total number is a count corresponding to a plurality of image frames to be identified.

In some embodiments, the foreground sharpness category with sharp foreground corresponds to a first proportional threshold, the foreground sharpness category with normal foreground corresponds to a second proportional threshold, the foreground sharpness category with fuzzy foreground corresponds to a third proportional threshold, and the second proportional threshold, the first proportional threshold, and the third proportional threshold are arranged in descending order. The first video sharpness of the video is determined according to the proportion of the number of image frames included in each foreground sharpness category in the total number, and can be identified through corresponding conditions, which is exemplified below.

Condition 1) when the proportion of the number of image frames included in the foreground definition category with clear foreground definition in the total number is larger than a first proportion threshold value, determining that the first video definition of the video is clear; and when the proportion of the number of the image frames in the total number of the foreground general definition categories is larger than a second proportion threshold value and the proportion of the number of the image frames in the total number of the foreground fuzzy foreground definition categories is smaller than a third proportion threshold value, determining that the first video definition of the video is general.

And 2) determining that the first video definition of the video is general when the proportion of the number of the image frames included in the foreground definition category with clear foreground in the total number is less than a first proportion threshold and the proportion of the number of the image frames included in the foreground definition category with blurred foreground in the total number is zero.

And 3) when the proportion of the number of the image frames in the total number of the image frames included in the foreground definition category of the foreground blur is larger than a third proportion threshold value, determining the first video definition of the video as the blur.

For example, assuming that the total number of the counts of the multiple image frames to be identified is m frames, where m is a natural number greater than zero, the server calls a foreground definition model to identify the foreground in the multiple image frames, and obtains that the number of the image frames included in the foreground definition category with clear foreground is a, the number of the image frames included in the foreground definition category with general foreground is b, and the number of the image frames included in the foreground definition category with blurred foreground is c; wherein a, b and c are natural numbers larger than zero. Several cases can be distinguished:

case 1) if

If so, judging that the result of the first video definition of the video is clear;

case 2) if

And is

If so, judging that the first video definition of the video is normal; if it is

And is

When the video is normal, the first video definition of the video is normal;

case 3) if

Then the first video sharpness of the video is blurred.

As an example of step S1032, the server determines the first video definition of the video according to a ratio of the number of image frames included in each foreground definition category to the total number, and may further include: when the proportion of the number of the image frames in the total amount of the foreground definition category with clear foreground is larger than a first proportion threshold value, determining that the first video definition of the video is clear; when the proportion of the number of the image frames in the total number of the foreground definition categories with clear foreground is smaller than a first proportion threshold value, and the proportion of the number of the image frames in the total number of the foreground definition categories with common foreground is larger than a second proportion threshold value, determining that the first video definition of the video is common; and when the proportion of the number of the image frames in the total number of the foreground definition categories with blurred foreground is larger than a third proportion threshold value, determining the first video definition of the video as blurred.

In step S104, when the first video definition of the video does not satisfy the definition condition, the server performs definition recognition on the backgrounds of the plurality of image frames to obtain the background definition of each image frame.

In some embodiments, the server performs definition recognition on the background of a plurality of image frames to obtain the background definition of each image frame, including: the following processing is performed for each image frame: mapping image features of the image frames to confidence degrees of different background definition categories; wherein the background definition categories include: the background is clear and the background is blurred.

It should be noted that the background sharpness model is a binary model, and the background sharpness and the confidence corresponding to the background sharpness and the background blur and the confidence corresponding to the background blur are output for each image frame, and only the confidence of the background blur is utilized in the application. As an example, the background definition model may employ a resnet-50 network.

It is noted that the first video sharpness recognition result of the video may be a qualitative sharpness category, such as sharpness, general, blur; but may also be a quantified sharpness score, such as any score from 0-10.

As an example of step S104, when the first video definition of the video is a blurred image, definition recognition is performed on the background of the plurality of image frames; or when the first video definition of the video is an image with a score lower than a definition score threshold value, performing definition identification on the background of the plurality of image frames. For example, when the score of the first video definition of the video is 0-2 minutes, it is determined that the first video definition of the video does not satisfy the definition condition.

In step S105, a second video sharpness of the video is determined based on the background sharpness of each image frame, and is used as an updated sharpness recognition result of the video.

Referring to fig. 4C, fig. 4C is a flowchart illustrating a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention, and in some embodiments, fig. 4C shows that step S105 in fig. 4A can be implemented by steps S1051-S1052 shown in fig. 4C. In step S1051, the confidence levels of each image frame belonging to the background sharpness category of the background blur are accumulated and averaged to obtain an average confidence level; in step S1052, when the mean confidence is greater than the confidence threshold, the second video sharpness of the video is determined to be blurred, and when the mean confidence is less than or equal to the confidence threshold, the second video sharpness of the video is determined to be general.

In some embodiments, referring to fig. 4C, fig. 4C is a schematic flowchart of a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention, and based on fig. 3, before step S105, the following steps may be further performed:

in step S201, category information of a video is acquired;

in step S202, a confidence threshold corresponding to the video category information is searched for in the correspondence between the plurality of video categories and the confidence thresholds.

It is worth noting that the confidence thresholds used may be different for different video category information. For example, for dancing videos, a slight character ghost does not affect the human appearance, since the character may be ghost but the background is clear; for a close-range video, a slight blur may aggravate the human perception of the video. The confidence threshold setting for a dance class video may be higher than that for a close-range class video, and thus, the confidence threshold setting may be different for different classes of videos. As an example, the correspondence of the video category information to the confidence threshold may be stored in a database of the server or the terminal device for invocation by the server or the terminal device.

The method for processing the video definition based on the artificial intelligence can automatically select different identification modes aiming at different types of videos, and realizes efficient and accurate identification of various types of videos. For clear videos and general videos, only a foreground definition model is required to be called for definition recognition, so that a definition recognition result can be efficiently and accurately obtained, recall and accuracy of the clear videos are improved, and processing efficiency is improved. And aiming at the fuzzy video, the background definition model is further called for definition identification so as to update the result of the video definition identification. By the method, the videos judged to be fuzzy due to the fuzzy foreground can be recalled, and the accuracy of the fuzzy videos is improved. Finally, the purpose of simulating the definition of human sense recognition video can be achieved.

In some embodiments, referring to fig. 4B, based on fig. 4A, after step S105, the following steps may also be performed:

in step S106, the result of the definition recognition of the video is sent to the recommendation system, so that the recommendation system performs a corresponding recommendation operation according to the definition of the video.

In some embodiments, referring to fig. 6, fig. 6 is an architectural diagram of a recommendation system provided by an embodiment of the present invention. In fig. 6, the recommendation system includes a definition module, a personalization module, a recall module, a ranking module, a diversity module, and a recommendation module based on diversity + definition. The personalized module is used for calculating the user portrait according to the user behavior so as to obtain interest preferences under different dimensions according to user attributes, historical behaviors, interest contents and the like; the definition module is used for realizing the definition processing process of the video to obtain a candidate video with higher definition, and also can store the definition recognition result obtained by the definition module locally and directly use the result; the recall module comprises a recall model of a plurality of channels such as collaborative filtering, a theme model, content recall, Social Network Service (SNS) and the like, and ensures the diversity of candidate videos during recall; and the sorting module is used for uniformly scoring and sorting the recalled results, and selecting the video which is most interesting for the user and has high definition from the candidate videos, namely selecting the optimal video from the candidate videos so as to obtain the videos which meet the diversity and definition conditions. The recommendation system gives consideration to multiple dimensions of diversity, definition, individuation and the like of recommendation results, and can meet the requirement of diversity of users.

The video definition model based on artificial intelligence can realize the definition processing of various videos by the background, and a large amount of labor cost is saved. Meanwhile, the obtained video definition result can be applied to a recommendation system, and videos with higher definition are recommended to users in the recommendation system so as to increase the video click rate and the watching time of the users; the video definition result can also be stored in the server and then used offline by the recommendation system.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

In the implementation process of the embodiment of the invention, the following problems are found in the related art: taking an application scene for judging the definition of the short video as an example, with the continuous development of the mobile internet, the mobile platforms such as the smart phone and the like rise rapidly, and the short video using the smart phone/tablet as a carrier becomes a new content transmission form in recent years. With the explosive growth of short video data, how to quickly and accurately judge the definition of the short video becomes important for a background system. However, the short video has the characteristics of numerous categories and most of video frames are motion frames, so that the difficulty in judging the definition of the short video is increased. If the video frame is a motion frame, the recognition result cannot accurately represent the definition of the whole short video; however, the identification of the fusion features between frames through the time sequence model can result in a slow identification speed. Therefore, how to efficiently and accurately identify the video definition is a problem solved by the invention.

As an example, referring to fig. 7, fig. 7 is a schematic diagram of two image frames extracted from a short video according to an embodiment of the present invention, and fig. 7 shows a foreground 301, a foreground 302, a background 303, and a background 304, it can be seen that, in the prior art, because the foreground in the two image frames includes a virtual image, the recognition result of the model in the prior art is fuzzy, but it is clear that the background of the two image frames is relatively clear, and the definition of the short video in which the two image frames are located should be determined to be general by human senses. For example, dancing, sports, etc. category videos: the characters in the video frame may be ghost images, but the background is clearer; distant view type video: the video single frame cannot be clearly distinguished due to the fact that the main body is too small; the videos of the types have good overall impression, and the definition of the short videos should be judged to be general according to the sense of human beings. That is, the definition recognition method in the related art is not suitable for various service scenarios.

In view of the above problems, the embodiments of the present invention provide a video sharpness processing method based on artificial intelligence, which can not only combine with a service scene to provide video sharpness processing more suitable for the service scene, but also has higher processing efficiency, and effectively improves the efficiency and precision of video sharpness processing.

The video definition processing and detecting method based on artificial intelligence provided by the embodiment of the invention extracts a plurality of image frames from a video, carries out definition recognition on the foregrounds of the image frames to obtain the recognition result of video definition, and carries out definition recognition on the background of the corresponding image frame according to the judgment of the recognition result to obtain the updated video definition recognition result.

Referring to fig. 8, fig. 8 is a schematic flow chart of a video sharpness processing method based on artificial intelligence according to an embodiment of the present invention, and an implementation scheme of the embodiment of the present invention is specifically as follows:

the method comprises the steps of extracting k frames at equal intervals from a video by using a multimedia video processing tool (FFMpeg), clustering the video frames according to characteristics such as a color histogram, a canny operator and the like to filter repeated frames, primarily screening the frames to mainly filter out over-fuzzy frames, and finally selecting m frames from the k frames, wherein k and m are fixed constants.

Performing frame extraction on short videos uploaded by a user, performing normalization processing on each frame image in m frames, sending the images into a foreground definition model, and judging the definition of foreground content of each frame image, wherein the foreground definition model supports pictures with any size as input; the output is three categories of clear foreground, general foreground and fuzzy foreground and corresponding confidence.

In some examples, the foreground sharpness category is identified by a respective condition:

condition 1)

The above frames are clear, and the short video is clear;

condition 2)

The following frames are clear, and the rest frames are general, and the short video is general;

condition 3)

The above frames are general in that,

if the following frames are fuzzy, the short video is normal;

condition 4) if

The above frames are blurred, and a background sharpness model is input.

And if the foreground definition model judges that the video is fuzzy, acquiring the category information of the video, and giving the definition of the short video based on the confidence coefficient of the m-frame result given by the background definition model and the video category information.

As an example, assuming that the result of the output of the background sharpness model is (cof _1, cof _2), cof _1 represents the confidence that the output background is blurred, and cof _2 represents the confidence that the output background is not blurred, wherein cof _1+ cof _2 is 1, cof _1 of m frames is added, the range of blurring may be different due to different types of videos, and then based on different types, a threshold value of cof _1 is given, if cof _1_ avg > thre of m frames, the short video gives a blurred label, otherwise, a general label is given, wherein thre is the confidence threshold value corresponding to the video type.

In some examples, the background sharpness model is mainly used for determining the background sharpness of an excessive frame (e.g., a motion frame, subject blurring, and background sharpness), and the background sharpness model is a binary model and is used as an auxiliary determination of the foreground sharpness model to determine whether the background of an image is sharp. The network mainly adopted by the background definition model is resnet-50.

In some examples, referring to fig. 5, fig. 5 is a schematic structural diagram of a foreground sharpness model provided in an embodiment of the present invention, and as an example, a main network of the foreground sharpness model mainly includes a convolutional layer, a pooling layer, a residual module, a down-sampling layer, an adaptive down-sampling layer, a random deactivation layer (dropout), and a fully-connected layer; the residual module mainly selects convolution layers of convolution kernels of 5 x 5, 3 x 3 and 1 x 1, and then the convolution layers are directly connected; the down-sampling layer mainly utilizes the convolution layer or the pooling layer with the step length of 2 to down-sample the image; the adaptive down-sampling layer can convert feature maps of any scale with the same channel number into feature vectors of the same dimension, and therefore the convolutional neural network model can take images of any scale as input of the model.

In some examples, a framework of a foreground sharpness model is described, referring to fig. 5, based on fig. 5, fig. 5 is a schematic structural diagram of the foreground sharpness model provided in an embodiment of the present invention, and as an example, an input layer of the foreground sharpness model performs normalization processing on an input image; the categories of the hidden layer of the foreground sharpness module may include: the device comprises a convolution layer, a pooling layer, a residual module, a down-sampling layer, a self-adaptive pooling layer, a random deactivation layer and a full-link layer.

And (3) rolling layers: carrying out convolution linear mapping processing on an input image to extract image characteristics; it should be noted that some features of the image can be extracted from the input image through mathematical operation with a convolution kernel, and the extracted image features are different due to different convolution kernels, so that for training of the foreground definition model, the complexity of the model can be reduced by extracting the image features with the best performance, and a large amount of computing resources and computing time can be saved.

A pooling layer, which can select mean pooling or maximum pooling to obtain main image features; wherein, the average pooling is to average the values in the pooling area, and the maximum pooling is to divide the feature map into a plurality of rectangular areas and to take the maximum value of each area; after pooling, unimportant image features in the feature map of the convolutional layer are removed, and the number of parameters is reduced to reduce overfitting.

The down-sampling layer is a nonlinear down-sampling method, and can output the features extracted by each down-sampling in parallel through 4 times of serial down-sampling processing so as to extract 4 groups of feature maps with different sizes; wherein, a residual module is added before the down-sampling process. It should be noted that the direct connection operation in the residual error module plays a role of direct transmission through simple identity mapping, retains the spatial structure of the gradient, and alleviates the problem of gradient fragmentation of the model.

And the self-adaptive pooling layer converts 4 groups of feature maps with different sizes output by the down-sampling layer into 4 groups of feature maps with the same size, and integrates the 4 groups of feature maps with the same size into 1 group of feature maps through connection processing. It should be noted that, the connection processing is to add two by two 4 sets of feature maps through an addition operation, and output all the added feature maps; the adaptive down-sampling layer automatically calculates the size of a convolution kernel and the step length of each movement according to the set sizes of the input image and the output image so as to output the set size of the output image, namely the adaptive down-sampling layer can convert feature maps with the same number of channels and any sizes into feature vectors with the same dimensionality, so that the foreground definition model supports processing of images with any sizes.

And the full connection layer integrates all the features obtained before convolution into an N-dimensional feature vector. And a plurality of neuron nodes are discarded between the two fully-connected layers through a random inactivation layer (dropout) with a certain probability, so that the joint adaptability between the neuron nodes is weakened. For example, the dropout rate may be 50% for discarding half of the neuron nodes.

And the output layer is used for classifying the N-dimensional characteristic vectors by adopting a logistic regression softmax function so as to output the definition categories of each frame image and the confidence corresponding to each definition category, wherein N is a natural number greater than zero.

In some examples, before using the model, based on the nature of the video itself, and business side requirements, the sharpness category for short videos: clear, general and fuzzy three categories respectively make a quantization standard, and the training samples are labeled.

The video definition model of the artificial intelligence can directly enable the background to process the definition of the short video, and a large amount of labor cost can be saved. Meanwhile, the obtained result can be applied to a recommendation system, and short videos with high definition are recommended to the user in the system so as to increase the click rate and the watching time of the user.

Continuing with the exemplary architecture of the artificial intelligence based video sharpness processing apparatus 455 provided by the embodiments of the present invention as implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based video sharpness processing apparatus 455 of the memory 440 may include:

an extracting module 4551, configured to extract a plurality of image frames to be identified from a video; the first identification module 4552 is configured to perform definition identification on a foreground in a plurality of image frames to obtain a foreground definition of each image frame; a first determining module 4553, configured to determine a first video definition of the video based on the foreground definition of each image frame, and use the first video definition as a definition recognition result of the video; a second identifying module 4554, configured to identify the sharpness of the background of the multiple image frames when the first video sharpness of the video does not meet the sharpness condition, so as to obtain the background sharpness of each image frame; a second determining module 4555, configured to determine a second video definition of the video based on the background definition of each image frame, and as an updated definition recognition result of the video.

In the foregoing scheme, the extracting module 4551 is configured to: performing frame extraction on a video at equal intervals to obtain a first image frame set; clustering image frames in the first image frame set to obtain a plurality of similar image frame subsets, randomly extracting one image frame from each similar image frame subset, and combining images which are not clustered to any similar image frame subset in the first image set to form a second image frame set; and filtering out the image frames meeting the blurring condition from the second image frame set, and taking the remaining multi-frame image frames in the second image frame set as the image frames to be identified.

A first identification module 4552 configured to:

A first determining module 4553, configured to:

determining the number of image frames to be identified included in each foreground definition category based on the foreground definition category to which each image frame belongs;

A first determining module 4553, configured to:

when the proportion of the number of the image frames in the total amount of the foreground definition category with clear foreground is larger than a first proportion threshold value, determining that the first video definition of the video is clear;

when the proportion of the number of the image frames in the total number of the foreground general definition categories is larger than a second proportion threshold value and the proportion of the number of the image frames in the total number of the foreground fuzzy foreground definition categories is smaller than a third proportion threshold value, determining that the first video definition of the video is general;

when the proportion of the number of the image frames in the total number of the foreground definition categories with clear foreground is smaller than a first proportional threshold and the proportion of the number of the image frames in the total number of the foreground definition categories with fuzzy foreground is zero, determining that the first video definition of the video is general;

A second identification module 4554 configured to:

the following processing is performed for each image frame:

mapping image features of the image frames to confidence degrees of different background definition categories;

A second determining module 4555, configured to:

accumulating the confidence coefficients of the image frames belonging to the background definition category with clear background and taking the average value to obtain an average confidence coefficient;

and when the mean confidence coefficient is larger than the confidence coefficient threshold value, determining the second video definition of the video to be fuzzy, and when the mean confidence coefficient is smaller than or equal to the confidence coefficient threshold value, determining the second video definition of the video to be general.

A second determining module 4555, further configured to:

acquiring the category information of a video;

A recommendation module 4556 configured to: and sending the definition recognition result of the video to a recommendation system so that the recommendation system executes corresponding recommendation operation according to the definition of the video.

Embodiments of the present invention provide a storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present invention, for example, an artificial intelligence based video sharpness processing method as shown in fig. 4A, 4B or 4C.

In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts stored in a hypertext markup language (HTM L, HyperTextMarkup L engine) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiments of the present invention, a plurality of image frames are extracted from a video, the definitions of the foregrounds of the image frames are identified to obtain the identification result of the video definition, and the definitions of the backgrounds of the corresponding image frames are identified according to the judgment of the identification result to obtain the updated identification result of the video definition, so that different identification modes can be automatically selected for different types of videos, the definition of a short video can be efficiently and accurately identified, and the purpose of simulating the definition of a human sense identification video is achieved; for clear videos and general videos, only a foreground definition model is required to be called for definition identification, so that the definition identification result can be efficiently and accurately obtained, recall and accuracy of the clear videos are improved, and processing efficiency is improved; aiming at the fuzzy video, the background definition model is further called to carry out definition recognition so as to update the result of the video definition recognition, and through the method, the video which is judged to be fuzzy due to fuzzy foreground can be recalled, so that the precision rate of the fuzzy video is improved; and inputting the obtained video definition result into a recommendation system so that the recommendation system recommends the video with higher definition to the user to increase the video click rate and the watching time of the user.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A video sharpness processing method based on artificial intelligence, the method comprising:

extracting a plurality of image frames to be identified from a video;

2. The method of claim 1, wherein the extracting the plurality of image frames to be identified from the video comprises:

clustering image frames in the first image frame set to obtain a plurality of similar image frame subsets, extracting one image frame from each similar image frame subset, and combining images which are not clustered to any similar image frame subset to form a second image frame set;

3. The method of claim 1, wherein the sharpness identifying a foreground in the plurality of image frames to obtain a foreground sharpness for each of the image frames comprises:

performing the following processing for each of the image frames:

4. The method of claim 3,

the determining a first video sharpness of the video based on the foreground sharpness of each of the image frames comprises:

5. The method of claim 4,

the determining the first video definition of the video according to the proportion of the number of the image frames included in each foreground definition category in the total number comprises:

6. The method of claim 1, wherein said performing sharpness recognition on the background of the plurality of image frames to obtain the background sharpness of each of the image frames comprises:

performing the following processing for each of the image frames:

7. The method of claim 6, wherein said determining a second video sharpness for the video based on the background sharpness for each of the image frames comprises:

accumulating the confidence coefficients of the image frames belonging to the background definition category of the background blur and taking the average value to obtain an average confidence coefficient;

8. The method of claim 7, further comprising:

acquiring the category information of the video;

9. The method according to any one of claims 1 to 8, further comprising:

and sending the definition recognition result of the video to a recommendation system so that the recommendation system executes corresponding recommendation operation according to the definition of the video.

10. A video sharpness processing apparatus based on artificial intelligence, comprising: