CN113572976A

CN113572976A - Video processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN113572976A
Application number: CN202110164639.3A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-10-29

Abstract

The application relates to the technical field of artificial intelligence video processing, and discloses a video processing method, a device, an electronic device and a readable storage medium, wherein the video processing method comprises the following steps: acquiring an initial video to be processed, and performing video split-mirror on the initial video to obtain a plurality of sub-segments; for each sub-segment, determining an object identifier and an emotion label corresponding to the sub-segment; determining at least one target sub-segment from the at least one sub-segment; and acquiring target background music corresponding to the emotion label sequence, and synthesizing the target background music and the target sub-segments to obtain a target video. The video processing method provided by the application can generate the target video which accords with different target object identifications and target moods according to requirements, and is not limited to a single object.

Description

Video processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, an electronic device, and a readable storage medium.

Background

With the development of computer and network technologies, the functions of electronic devices are becoming more and more diversified. For example, a user may make a video clip through the electronic device.

Video can be processed according to a neural network by using a deep learning method, for example, a video GAN (Generative adaptive Networks) is used.

However, in the current deep learning mode, the video is processed, only a certain action video of a single person can be generated, and the output result is not controllable.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:

in a first aspect, a video processing method is provided, including:

acquiring an initial video to be processed, and performing video framing on the initial video to obtain a plurality of sub-segments;

for each sub-segment, determining an object identifier and an emotion label corresponding to the sub-segment;

determining at least one target sub-segment from the at least one sub-segment; wherein, the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags includes at least one target emotion tag;

and acquiring target background music corresponding to the emotion label sequence, and synthesizing the target background music and the target sub-segments to obtain a target video.

In an optional embodiment of the first aspect, for each sub-segment, determining an object identifier corresponding to the sub-segment comprises:

for each sub-segment, identifying at least one object present in the sub-segment;

based on the at least one object that appears, an object identification corresponding to the sub-segment is determined.

In an alternative embodiment of the first aspect, identifying at least one object present in the sub-segment comprises:

carrying out image detection on the video frame image of the sub-segment, and determining a target vector of the sub-segment;

and respectively matching the target vectors with the standard target vector of at least one object corresponding to the initial video to determine the objects appearing in the sub-segments.

In an optional embodiment of the first aspect, determining an object identification corresponding to the sub-segment based on the at least one object that appears comprises:

determining a valid object of the at least one object that appears;

and setting the identity corresponding to the effective object as the object identity corresponding to the sub-segment.

In an optional embodiment of the first aspect, determining a valid object of the at least one object that appears comprises:

determining a total number of video frame images in the sub-segment;

determining a first number of video frame images in which an object appears in a sub-segment;

and if the ratio of the first quantity to the total quantity is larger than a first preset ratio, setting the object as a valid object.

In an optional embodiment of the first aspect, determining the emotion label corresponding to the sub-segment comprises:

determining an emotion label of each video frame image in the sub-segment;

and setting the emotion label with the highest occurrence frequency in the video frame image of the sub-segment as the emotion label of the sub-segment.

In an optional embodiment of the first aspect, obtaining the target background music corresponding to the sequence of emotion labels comprises:

determining a target emotion label with the highest frequency of occurrence in the emotion label sequence;

acquiring target background music corresponding to a target emotion label with the highest occurrence frequency from a preset music database; the music database is provided with a plurality of background music, and each background music is provided with a corresponding emotion tag.

In an optional embodiment of the first aspect, before synthesizing the target video based on the target background music and the target sub-segment, the method further includes:

determining the emotion scale of each video frame image in each target sub-segment aiming at each target sub-segment;

synthesizing the target background music and the target sub-segments to obtain a target video, wherein the synthesizing comprises the following steps:

for each target sub-segment, the target sub-segment is clipped based on a target emotion scale, so that the emotion scale of the starting video frame image of the clipped target sub-segment accords with the target emotion scale;

and synthesizing the target background music and the clipped target sub-segments to obtain the target video.

generating at least one close-up video based on the target sub-segment; wherein the close-up video comprises a close-up picture of the object in the target sub-segment;

and synthesizing the target video based on the target background music, the target sub-segment and the close-up video.

In an alternative embodiment of the first aspect, the synthesizing the target video based on the target background music and the target sub-segment includes:

determining the sequence of each target emotion tag in the emotion tag sequence;

synthesizing at least one target sub-segment based on the sequence of each target emotion label to obtain a target segment;

and synthesizing the target segment and the target background music to obtain the target video.

In a second aspect, there is provided a video processing apparatus comprising:

the lens splitting module is used for acquiring an initial video to be processed and performing video lens splitting on the initial video to obtain a plurality of sub-segments;

the first determining module is used for determining an object identifier and an emotion label corresponding to each sub-segment;

a second determining module, configured to determine at least one target sub-segment from the at least one sub-segment; wherein, the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags includes at least one target emotion tag;

and the synthesis module is used for acquiring target background music corresponding to the emotion label sequence and synthesizing the target background music and the target sub-segments to obtain a target video.

In an optional embodiment of the second aspect, when determining, for each sub-segment, an object identifier corresponding to the sub-segment, the first determining module is specifically configured to:

In an alternative embodiment of the second aspect, the first determining module, when identifying at least one object present in a sub-segment, is specifically configured to:

In an optional embodiment of the second aspect, when determining the object identifier corresponding to the sub-segment based on the at least one object that appears, the first determining module is specifically configured to:

determining a valid object of the at least one object that appears;

In an optional embodiment of the second aspect, the first determining module, when determining a valid object of the at least one object that appears, is specifically configured to:

determining a total number of video frame images in the sub-segment;

In an optional embodiment of the second aspect, the first determining module, when determining the emotion label corresponding to the sub-segment, is specifically configured to:

determining an emotion label of each video frame image in the sub-segment;

In an optional embodiment of the second aspect, the synthesizing module, when acquiring the target background music corresponding to the sequence of emotion labels, is specifically configured to:

In an optional embodiment of the second aspect, further comprising a third determining module, configured to:

when the target video is obtained by synthesizing the target background music and the target sub-segment, the synthesizing module is specifically configured to:

In an optional embodiment of the second aspect, further comprising a generating module, configured to:

In an optional embodiment of the second aspect, when synthesizing the target video based on the target background music and the target sub-segment, the synthesizing module is specifically configured to:

In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the video processing method shown in the first aspect of the present application is implemented.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the video processing method shown in the first aspect of the present application.

The beneficial effect that technical scheme that this application provided brought is:

obtaining a plurality of sub-segments by dividing a video into a plurality of sub-segments, determining an object identifier and an emotion tag of each sub-segment, determining a target sub-segment which accords with a target object identifier and a target emotion tag from the plurality of sub-segments, and synthesizing the target sub-segment and target background music to obtain a target video, wherein an emotion tag sequence of the target background music corresponds to the emotion tag of the target sub-segment, so that the emotion matching degree of a video picture and background music in the synthesized target video is improved; in addition, target videos which meet different target object identifications and target moods can be generated according to requirements, and are not limited to a single object.

Furthermore, the target vector of the sub-segment is obtained by carrying out image detection on the video frame image of the sub-segment, and the object appearing in the sub-segment is determined by matching the target vector with the standard target vector of the object, so that the accuracy of object identification can be improved. Further, close-up recognition is carried out on the target sub-segment to generate a close-up video, and the target video is obtained based on target background music, the target sub-segment and the close-up video in a synthesizing mode, so that the target video contains close-up of the object, emotion changes of the object in the target video can be displayed more clearly, and display effect of the target video is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is an application scene diagram of a video processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an initial video binning scheme provided in one example of the present application;

fig. 4 is a schematic flowchart of a video processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a scheme for clipping a target segment provided in an example of the present application;

FIG. 6 is a schematic diagram of a scheme for compositing target videos provided in one example provided herein;

fig. 7 is a flowchart illustrating a video processing method according to an example provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device for video processing according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "first" and "second" are used herein to distinguish different features, and are not intended to limit the order or number of features, and the number of corresponding features may be the same or different. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to an artificial intelligence video processing technology, and is specifically explained by the following embodiment.

The task of intelligent video production is to input a long video and generate a short video (video highlights) of related content through an algorithm. Highlights (e.g., short videos associated with a certain television show at a station B or a micro-television) typically include 10-30 seconds long: man and woman interaction, various lovely or skill outbreaks of a certain star, and the like. The traditional wonderful highlights comprise the steps of manually selecting materials (image sequences or video clips), determining the sequence of the materials, selecting the score, synthesizing the video and the like, and the time from collecting the materials to finishing the video is different from 3 hours to 10 hours. The generation of the video highlights is a work which consumes manpower and mental effort.

Video can be processed by manual processing or deep learning video gan.

The artificially generated mode requires artificially screening the material, manually arranging the material, selecting the composite video such as the score, and the like.

The video-on can generate a short-motion video under the same background by a single person, and the specific method comprises the following steps: firstly, the motion of the camera itself is not considered (the Background does not move), so that on the assumption that the whole video is composed of a static Background and a dynamic Foreground, a two-way (two-stream) architecture is designed to generate a Background (Background) Foreground (forkround) respectively, and the Foreground f (z) and the Background b (z) are fused in the following manner, namely, a mask m (z) is adopted for linear fusion. The model inputs a random noise, and the output video is obtained through two-way generation and fusion.

The main problems of the manual method are 1) the need to manually analyze the basic materials: material analysis needs to be carried out from an original video, such as video split-lens, people who exist, perspective judgment and the like; 2) the target video scenario sequence needs to be designed manually: the method comprises the following steps that (1) a wonderful collection style (such as expression collection, story line collection, character mixed shearing collection and the like) which is finally generated is required to be designed in advance; 3) selecting the final materials manually and arranging the materials; 4) artificial music matching: and carrying out artificial dubbing according to the materials and the scenes of the exhibition required.

The problem with videoGAN is 1) single results: only a single action video can be generated, and rich and colorful video styles cannot be generated; 2) the difficulty of migration of the new drama is large: for each character of the television show, the data needs to be re-labeled and the two-way branch of the GAN needs to be trained, so that the generated character and environment can accord with the television show; 3) the results are unpredictable: the input is only one random number disturbance, and the output result is not controllable; 4) need artifical happy: without the soundtrack information, manual soundtrack is also required.

The method is based on the deep learning face recognition and face attribute recognition capabilities, character recognition, emotion recognition and music emotion recognition are carried out on the video, the target character expression short film and background music are extracted, and therefore the video is synthesized, and the automatic character collection generation method is achieved. It has the following effects:

1) the steps of reducing manual input, and manually selecting materials and manually editing and synthesizing;

2) the materials can be screened through pre-trained deep learning models such as face recognition, music emotion recognition, face emotion recognition and the like;

3) different editing effects (such as different emotions, different face distance alternation and the like) can be obtained by adding additional face attribute screening materials on the whole frame.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

As shown in fig. 1, the video processing method of the present application may be applied to the scene shown in fig. 1, specifically, a terminal 10 acquires an initial video to be processed, sends the initial video to a server 20, and the server 20 performs video mirroring on the initial video to obtain a plurality of sub-segments; for each sub-segment, determining an object identifier and an emotion label corresponding to the sub-segment; the server 20 obtains the target background music corresponding to the emotion label sequence, and synthesizes the target background music and the target sub-segment to obtain the target video, and the server 20 returns the target video to the terminal 10.

In the scenario shown in fig. 1, the video processing method may be performed in the terminal and the server, or may be performed only in the server or the terminal.

Those skilled in the art will appreciate that the "terminal" used herein may be a Mobile phone, a Tablet Computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), a Laptop (Laptop), a desktop Computer (desktop Computer), a Tablet Computer (Tablet Personal Computer), a Smart television (Smart TV), a Smart watch, a Smart car Device, etc.; the "server" may be implemented as an independent server or a server cluster composed of a plurality of servers, or a cloud server.

A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 2, a video processing method is provided, where the method may be executed by a terminal or a server, or may be executed by both the terminal and the server, and may include the following steps:

step S201, obtaining an initial video to be processed, and performing video split-mirror on the initial video to obtain a plurality of sub-segments.

The video segmentation refers to segmenting an initial video according to different shot conversion to obtain sub-segments under the segmentation of different scenes, and may also be referred to as scene segmentation.

Specifically, the initial video may be subjected to a partial mirror by using pyscenedetect, where pyscenedetect is a codebase that can perform detection and analysis on a scene of the video, and is used to perform scene segmentation on the video.

As shown in fig. 3, an initial video of a total of 6 frames can be divided into 2

sub-mirrors

301 and 301, one sub-mirror 301 having two persons a and B appearing therein and the other sub-mirror 302 having another person C appearing therein.

Step S202, aiming at each sub-segment, determining an object identifier and an emotion label corresponding to the sub-segment.

The object may be a person, the object identifier corresponding to the sub-segment may be a person identifier mainly appearing in the sub-segment, and the person identifier may be used to indicate the identity of the person, for example, if the initial video is a tv video, the person identifier may be a character name, and the like.

Wherein the emotion tag may be a category for representing the emotion of the subject, for example, the emotion of the person may be happy, angry, sad, surprised, and the like.

Specifically, the object identifier can be determined by performing face recognition on the sub-segments; the process of determining the emotion label, and in particular the object identification and emotion label, may be performed by a neural network for recognizing the emotion of the face of the image, as will be explained in detail below.

Step S203, determining at least one target sub-segment from at least one sub-segment; the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags includes at least one target emotion tag.

The preset emotion tag sequence may be a randomly generated sequence or a user-defined emotion tag sequence that needs to be generated, and the emotion tag sequence may include at least one target emotion tag.

For example, the target object is identified as "role a", the sequence of emotion tags may be "angry-happy-sad", the three determined target sub-segments are segments of "role a", and the emotion tags are "angry", "happy", and "sad", respectively.

Specifically, the object identifier of each sub-segment may be matched with the target object identifier, and the emotion tag of each sub-segment may be matched with each target emotion tag in the emotion tag sequence, so as to obtain a target sub-segment.

And step S204, acquiring target background music corresponding to the emotion label sequence, and synthesizing the target background music and the target sub-segments to obtain a target video.

Specifically, a music database may be preset, a plurality of background music is set in the music database, each background music is set with a corresponding emotion tag, and the target background music is determined from the background music database.

Specifically, the target sub-segment may be synthesized into the target segment, and then the target segment and the target background music may be synthesized into the target video.

In the above embodiment, a plurality of sub-segments are obtained through a video mirror, an object identifier and an emotion tag of each sub-segment are determined, a target sub-segment conforming to the target object identifier and the target emotion tag is determined from the plurality of sub-segments, the target sub-segment and target background music are synthesized to obtain a target video, and an emotion tag sequence of the target background music corresponds to the emotion tag of the target sub-segment, so that the emotion matching degree of a video picture and background music in the synthesized target video is improved; in addition, target videos which meet different target object identifications and target moods can be generated according to requirements, and are not limited to a single object.

The process of identifying the object identification for each sub-segment is described below in conjunction with specific embodiments.

As shown in fig. 4, the determining, for each sub-segment in step S202, an object identifier corresponding to the sub-segment may include:

step S210, for each sub-segment, identifying at least one object appearing in the sub-segment.

Specifically, the identifying of at least one object appearing in the subfragment of step S210 includes:

(1) carrying out image detection on the video frame image of the sub-segment, and determining a target vector of the sub-segment;

(2) and respectively matching the target vectors with the standard target vector of at least one object corresponding to the initial video to determine the objects appearing in the sub-segments.

The object may be a person, the image detection may be face recognition, and the target vector may be a face vector.

Specifically, a face recognition network may be used to perform face recognition on a plurality of video frame segments of a sub-segment to obtain face vectors appearing in the sub-segment, and for each face vector, the face vector may be matched with a standard face vector of an initial video, that is, the standard target vector, to determine an object appearing in the sub-segment.

Specifically, a standard image of the object may be extracted from the initial video in advance to generate a corresponding standard target vector.

For example, the initial video is a television show, a front face image of a person may be extracted from the television show, and a standard target vector of the person may be generated based on the front face image.

Specifically, a cosine similarity algorithm may be adopted, and if the similarity between the target vector and the standard target vector is greater than a preset similarity threshold, the object corresponding to the standard target vector is the object of the target vector.

Step S220, determining an object identifier corresponding to the sub-segment based on the at least one object appearing.

Specifically, the time or probability of the object appearing in the sub-segment is small, and the effective object in the sub-segment cannot be operated.

For example, if the initial video is a tv series, a character may only appear in one frame in a certain sub-segment of the tv series, and the character may only be a passer-by, and thus cannot be counted as a valid object in the sub-segment.

Specifically, the determining, based on the at least one object appearing in step S220, an object identifier corresponding to the sub-segment may include:

(1) determining a valid object of the at least one object that appears;

(2) and setting the identity corresponding to the effective object as the object identity corresponding to the sub-segment.

Wherein, the valid object is an object which can be used for representing the content of the sub-segment.

Specifically, determining a valid object of the at least one object that appears may include:

a. determining a total number of video frame images in the sub-segment;

b. determining a first number of video frame images in which an object appears in a sub-segment;

c. and if the ratio of the first quantity to the total quantity is larger than a first preset ratio, setting the object as a valid object.

It will be appreciated that there may be more than one object identification for a sub-segment, e.g., role a and role B are both valid objects in a sub-segment, and the object identification for the sub-segment is "role a + role B".

In the above embodiment, the target vector of the sub-segment is obtained by performing image detection on the video frame image of the sub-segment, and the object appearing in the sub-segment is determined by matching the target vector with the standard target vector of the object, so that the accuracy of object identification can be improved.

The following describes a process of determining an object identifier corresponding to a sub-segment with reference to an example.

In one example, where the initial video is a television show, the process of identifying an object identification for each sub-segment may include the steps of:

1) adopting a retinet open source face detection model to carry out face detection on the sub-segments (namely the image detection);

2) extracting human face embedding (namely the target vector) from a human face frame obtained by human face detection by adopting an insight face open source human face recognition model;

3) for the drama, target edited characters of a male, a female and the like are respectively extracted to extract an imbedding (namely a standard target vector of an object) of a face recognition model, a face seed database is established for the target characters, and the following are recorded: the person ID-face embedding information (namely establishing the relationship between each object identification and a standard target vector);

4) comparing the similarity of the face imbedding detected by the images in all the partial mirrors with the seed base imbedding (cosine similarity is adopted), designating a retrieval graph with the similarity being greater than a designated threshold thr as a person id (namely an object identifier) corresponding to the seed base;

5) for each sub-mirror, the face whose appearance proportion (total number of appearance in the video frame image of the sub-segment/number of images of the sub-segment) in the sub-mirror is greater than a specified threshold thrFace (i.e. a first preset ratio) is taken as the face ID (i.e. object identification) of the final sub-mirror.

6) According to the step 5), obtaining the face IDs of all video mirrors (namely determining the object identifications of all sub-segments).

The above examples illustrate the process of determining the object identification of a sub-segment and the process of emotion labeling is described below in connection with an embodiment.

A possible implementation manner is provided in the embodiment of the present application, and the determining the emotion label corresponding to the sub-segment in step S202 may include:

(1) determining an emotion label of each video frame image in the sub-segment;

(2) and setting the emotion label with the highest occurrence frequency in the video frame image of the sub-segment as the emotion label of the sub-segment.

Specifically, the representation of the emotion label may be measured by frequency of occurrence, and the emotion expression with the highest frequency of occurrence may be used as the emotion label of the sub-segment.

For example, the video frame images of the sub-segment have 10 frames in total, wherein the emotion labels of the 6 frame video frame images are all "happy," the emotion label of the 2 frame video frame image is "angry," and the emotion label of the 2 frame video frame image is "sad," then the emotion label of the sub-segment may be set to "happy.

A possible implementation manner is provided in the embodiment of the present application, and the acquiring of the target background music corresponding to the emotion label sequence in step S204 may include:

(1) determining a target emotion label with the highest frequency of occurrence in the emotion label sequence;

(2) acquiring target background music corresponding to a target emotion label with the highest occurrence frequency from a preset music database; the music database is provided with a plurality of background music, and each background music is provided with a corresponding emotion tag.

Specifically, the target emotion tag with the highest frequency of occurrence may be used as the overall tag of the target video to be synthesized, and the target background music corresponding to the target emotion tag with the highest frequency of occurrence may be selected.

Specifically, a music database may be pre-constructed, a plurality of background music may be set, and each background music may be set with a corresponding emotion tag.

In this embodiment, a possible implementation manner is provided, before the step S204 of synthesizing the target video based on the target background music and the target sub-segment, the method may further include: and determining the emotion scale of each video frame image in each target sub-segment.

Wherein the emotional scale may be an emotional intensity of the target.

For example, the mood scale may include degrees of calm, general, violent, etc.

Specifically, the emotion scale of each video frame image can be determined according to the emotion recognition network, the video frame images are input to the emotion recognition network, and if the probability of the output emotion label is higher, the corresponding emotion scale is larger.

The synthesizing of the target video based on the target background music and the target sub-segment in step S204 may include:

(1) for each target sub-segment, the target sub-segment is clipped based on a target emotion scale, so that the emotion scale of the starting video frame image of the clipped target sub-segment accords with the target emotion scale;

(2) and synthesizing the target background music and the clipped target sub-segments to obtain the target video.

Wherein the target emotion measure may be an emotion measure that the target sub-segment is required to have.

In a specific implementation process, for each target sub-segment, if the emotion scale of the starting video frame image before the cutting may not accord with the target emotion scale, the target sub-segment may be cut, so that the emotion scale of the starting video frame image accords with the target emotion scale.

As shown in fig. 5, fig. 5 shows the change of the emotion scale of different frame video frame images of the target sub-segment, and the emotion scale of a frame video frame image corresponding to 501 in the figure conforms to the target emotion scale, then the video frame image before 501 in the figure can be removed, and the video frame image 501 is taken as the starting video frame image, and the part shown by the hatching in the figure can be the clipped target sub-segment.

In this embodiment, a possible implementation manner is provided, before the step S204 of synthesizing the target video based on the target background music and the target sub-segment, the method may further include: at least one close-up video is generated based on the target sub-segment.

The close-up video includes a close-up picture of the object in the target sub-segment, and the close-up video may be a video specifically showing details of the object.

If the object is a person, the close-up video can be a face close-up video, and details are displayed for the face of the person.

In particular implementations, it may be determined whether the target sub-segment includes a close-up video, and if the target sub-segment does not include a close-up video, the close-up video may be generated based on the target sub-segment.

Specifically, the video frame image in the target sub-segment can be selected for close-up recognition, so as to generate a close-up video.

Specifically, the close-up video is generated based on the target sub-segment, or the target sub-segment comprises the close-up video, and the close-up video is also in accordance with the identification and emotion of the target object.

In the embodiment, the target sub-segment is recognized and a close-up video is generated, and the target video is synthesized based on the target background music, the target sub-segment and the close-up video, so that the target video contains the close-up of the object, the emotion change of the object in the target video can be displayed more clearly, and the display effect of the target video is improved.

A possible implementation manner is provided in the embodiment of the present application, and the synthesizing of the target background music and the target sub-segment in step S204 to obtain the target video may include:

(1) determining the sequence of each target emotion tag in the emotion tag sequence;

(2) synthesizing at least one target sub-segment based on the sequence of each target emotion label to obtain a target segment;

(3) and synthesizing the target segment and the target background music to obtain the target video.

Taking fig. 6 as an example, if the emotion tag sequence includes emotion changes of "anger-happy-sad", then the target sub-segments may be sorted according to the emotion changes in the emotion tag sequence, where the emotion tag of the target sub-segment 601 is "anger", the emotion tag of the target sub-segment 602 is "happy", the emotion tag of the target sub-segment 603 is "sad", and the target sub-segments are sorted and synthesized to generate the target segments; and then the target video is obtained by synthesizing the target segment and the target background music 604.

In order to better understand the above video processing method, as shown in fig. 7, an example of the video processing method of the present invention is set forth in detail below:

in an example, taking an object in a video as a person as an example, the video processing method provided by the present application may include the following steps:

1) acquiring an initial video to be processed;

2) performing video split-mirror on the initial video to obtain a plurality of sub-segments;

3) carrying out face recognition on each sub-segment, and obtaining face close-up;

4) recognizing the face emotion of each sub-segment, namely an emotion label;

5) selecting a target person, namely determining a target object identifier;

6) selecting emotion, namely determining a target emotion label;

7) determining a target sub-segment from the plurality of sub-segments based on the target object identification and the target emotion label, wherein the target sub-segment at the moment contains face features, namely, figure + emotion material collection shown in the figure;

8) constructing a music database; namely the mood-music material library shown in the figure;

9) acquiring target background music corresponding to the emotion label sequence, namely selecting a piece of music under the emotion shown in the figure;

10) and synthesizing the target sub-segment and the target background music to obtain the target video.

In order to better understand the above video processing method, an application scenario of the present application will be explained below with reference to an example.

According to the video processing method, the plurality of sub-segments are obtained through the video mirror, the object identifier and the emotion label of each sub-segment are determined, the target sub-segment which accords with the target object identifier and the target emotion label is determined from the plurality of sub-segments, the target sub-segment and the target background music are synthesized to obtain the target video, the emotion label sequence of the target background music corresponds to the emotion label of the target sub-segment, and therefore the emotion matching degree of a video picture and the background music in the synthesized target video is improved; in addition, target videos which meet different target object identifications and target moods can be generated according to requirements, and are not limited to a single object.

A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 8, a video processing apparatus 80 is provided, where the video processing apparatus 80 may include: a mirror splitting module 801, a first determining module 802, a second determining module 803, and a synthesizing module 804, wherein,

a mirror splitting module 801, configured to obtain an initial video to be processed, and perform video mirror splitting on the initial video to obtain multiple sub-segments;

a first determining module 802, configured to determine, for each sub-segment, an object identifier and an emotion label corresponding to the sub-segment;

a second determining module 803, configured to determine at least one target sub-segment from the at least one sub-segment; wherein, the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags includes at least one target emotion tag;

and the synthesizing module 804 is configured to obtain target background music corresponding to the emotion tag sequence, and synthesize the target background music and the target sub-segment to obtain a target video based on the target background music.

In this embodiment, a possible implementation manner is provided, and when determining, for each sub-segment, an object identifier corresponding to the sub-segment, the first determining module 802 is specifically configured to:

In an embodiment of the present application, a possible implementation manner is provided, and when identifying at least one object appearing in a sub-segment, the first determining module 802 is specifically configured to:

In an embodiment of the present application, a possible implementation manner is provided, where the first determining module 802 is specifically configured to, when determining, based on at least one object that appears, an object identifier corresponding to a sub-segment:

determining a valid object of the at least one object that appears;

In an embodiment of the present application, a possible implementation manner is provided, and when determining a valid object in at least one object that appears, the first determining module 802 is specifically configured to:

determining a total number of video frame images in the sub-segment;

In an embodiment of the present application, a possible implementation manner is provided, and when determining the emotion label corresponding to the sub-segment, the first determining module 802 is specifically configured to:

determining an emotion label of each video frame image in the sub-segment;

In the embodiment of the present application, a possible implementation manner is provided, and when obtaining the target background music corresponding to the emotion tag sequence, the synthesizing module 804 is specifically configured to:

The embodiment of the present application provides a possible implementation manner, further including a third determining module, configured to:

The embodiment of the present application provides a possible implementation manner, further including a generating module, configured to:

In an embodiment of the present application, a possible implementation manner is provided, and when a target video is obtained by synthesizing target background music and a target sub-segment, the synthesizing module 804 is specifically configured to:

According to the video processing device, the video processing mirror is used for obtaining a plurality of sub-segments, the object identifier and the emotion label of each sub-segment are determined, the target sub-segment which accords with the target object identifier and the target emotion label is determined from the plurality of sub-segments, the target sub-segment and the target background music are synthesized to obtain the target video, the emotion label sequence of the target background music corresponds to the emotion label of the target sub-segment, and therefore the emotion matching degree of a video picture and background music in the synthesized target video is improved; in addition, target videos which meet different target object identifications and target moods can be generated according to requirements, and are not limited to a single object.

The video processing device for pictures according to the embodiments of the present disclosure may execute the video processing method for pictures provided by the embodiments of the present disclosure, and the implementation principle is similar, the actions performed by each module in the video processing device for pictures according to the embodiments of the present disclosure correspond to the steps in the video processing method for pictures according to the embodiments of the present disclosure, and for the detailed functional description of each module in the video processing device for pictures, reference may be specifically made to the description in the video processing method for corresponding pictures shown in the foregoing, and no further description is given here.

Based on the same principle as the method shown in the embodiment of the present disclosure, an embodiment of the present disclosure also provides an electronic device, which may be a terminal or a server. The electronic device may include, but is not limited to: a processor and a memory; a memory for storing computer operating instructions; and the processor is used for executing the video processing method shown in the embodiment by calling the computer operation instruction. Compared with the prior art, the video processing method can generate the target video which accords with different target object identifications and target moods according to requirements, and is not limited to a single object.

In an alternative embodiment, an electronic device is provided, as shown in fig. 9, the electronic device 4000 shown in fig. 9 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the video processing method can generate the target video which accords with different target object identifications and target moods according to requirements, and is not limited to a single object.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:

determining at least one target sub-segment from the at least one sub-segment; the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags includes at least one target emotion tag;

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, a composition module may also be described as a "module for composing a target video".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A video processing method, comprising:

acquiring an initial video to be processed, and performing video split-mirror on the initial video to obtain a plurality of sub-segments;

determining at least one target sub-segment from the at least one sub-segment; wherein the emotion label of the target sub-segment corresponds to a preset emotion label sequence; the sequence of emotion tags comprises at least one target emotion tag;

2. The method of claim 1, wherein determining, for each sub-segment, an object identifier corresponding to the sub-segment comprises:

for each sub-segment, identifying at least one object appearing in the sub-segment;

3. The video processing method of claim 2, wherein said identifying at least one object appearing in said sub-segment comprises:

and respectively matching the target vector with a standard target vector of at least one object corresponding to the initial video to determine the objects appearing in the sub-segments.

4. The video processing method according to claim 2, wherein said determining an object identifier corresponding to the sub-segment based on the at least one object that appears comprises:

determining a valid object of the at least one object that appears;

and setting the identity corresponding to the effective object as the object identifier corresponding to the sub-segment.

5. The video processing method of claim 4, wherein the determining a valid object of the at least one object that appears comprises:

determining a total number of video frame images in the sub-segment;

determining a first number of video frame images in which the object appears in the sub-segment;

and if the ratio of the first quantity to the total quantity is greater than a first preset ratio, setting the object as the effective object.

6. The video processing method of claim 1, wherein determining the emotion label corresponding to the sub-segment comprises:

determining an emotion label of each video frame image in the sub-segment;

7. The video processing method according to claim 1, wherein said obtaining target background music corresponding to the sequence of emotion labels comprises:

acquiring target background music corresponding to the target emotion label with the highest occurrence frequency from a preset music database; the music database is provided with a plurality of background music, and each background music is provided with a corresponding emotion tag.

8. The video processing method according to claim 1, wherein before synthesizing the target video based on the target background music and the target sub-segment, the method further comprises:

the synthesizing of the target background music and the target sub-segment to obtain the target video comprises:

for each target sub-segment, the target sub-segment is clipped based on a target emotion scale, so that the emotion scale of the starting video frame image of the clipped target sub-segment conforms to the target emotion scale;

9. The video processing method according to claim 1, wherein before synthesizing the target video based on the target background music and the target sub-segment, the method further comprises:

generating at least one close-up video based on the target sub-segment; wherein the close-up video comprises close-up pictures of the objects in the target sub-segment;

and synthesizing to obtain a target video based on the target background music, the target sub-segment and the close-up video.

10. The video processing method according to claim 1, wherein said synthesizing a target video based on the target background music and the target sub-segment comprises:

determining an order of each target emotion tag in the sequence of emotion tags;

synthesizing the at least one target sub-segment based on the sequence of each target emotion label to obtain a target segment;

11. A video processing apparatus, comprising:

the system comprises a lens splitting module, a processing module and a processing module, wherein the lens splitting module is used for acquiring an initial video to be processed and performing video lens splitting on the initial video to obtain a plurality of sub-segments;

a second determining module, configured to determine at least one target sub-segment from the at least one sub-segment; the emotion tag sequence of the target sub-segment corresponds to a preset emotion tag sequence, and the emotion tag sequence comprises at least one target emotion tag;

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the video processing method of any of claims 1-10 when executing the program.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, implements the video processing method according to any one of claims 1 to 10.