CN113312516B

CN113312516B - Video processing method and related device

Info

Publication number: CN113312516B
Application number: CN202110558410.8A
Authority: CN
Inventors: 漆跃昕; 高帆; 叶小瑜; 梅晓茸; 刘旭东; 张梦馨; 陈铁军; 徐智伟; 赵媛媛; 李�杰; 曲贺; 袁肇豪; 唐小辉; 郭勇; 王妍; 李德智; 王昊; 张玕; 赵士强
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2023-11-21
Anticipated expiration: 2041-05-21
Also published as: CN113312516A

Abstract

The embodiment of the application provides a video processing method and a related device, which are used for analyzing a key frame containing a display object from a live broadcast playback video segment with the display object, and acquiring advertisement materials of the display video from an audio/video resource of the display video, so that the acquisition time period of the advertisement materials is shortened. Further, after special effect processing is carried out on the extracted advertisement materials, the advertisement materials are synthesized into the display video. The method solves the problem that a scheme capable of intelligently producing high-quality creative display video materials in batches is lacking in the related technology as far as possible.

Description

Video processing method and related device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method and a related device.

Background

With the rapid development of internet technology, browsing videos has become a common behavior in people's daily lives. The display video is generated, and compared with the traditional advertisement form, the advertisement form is more diversified in expression form and has better scene display function. The production of the display video depends on the design of the selection and style typesetting of advertisement elements by a designer, which has high requirements on the production efficiency and the production cost of the creative. The design of the display video requires related personnel to look up a large amount of data to find advertisement materials for manufacturing the display video, namely the acquisition time period of the advertisement materials is longer. Through analysis, the demand of the advertisement materials is larger at present, and a scheme capable of intelligently producing high-quality creative advertisement materials in batches is lacking in the related technology.

Disclosure of Invention

The application aims to provide a video processing method and a related device. The method is used for solving the problem that a scheme capable of intelligently producing high-quality creative display video materials in batches is lacking in the related technology.

In a first aspect, an embodiment of the present application provides a video processing method, where the method includes:

acquiring a target video of video content containing a plurality of display objects;

performing slicing processing on the video to obtain video segments of different display objects;

aiming at target display objects in the different display objects, analyzing key frames of the target display objects from video segments of the target display objects, and generating display videos of the target display objects based on the key frames.

In some possible embodiments, the slicing processing is performed on the video to obtain video segments of different display objects, including:

and detecting the content of the target video, and identifying video segments of different display objects.

In some possible embodiments, if the target video includes a display object tag, the slicing processing is performed on the target video to obtain video segments of different display objects, including:

Identifying the display object tag in the target video, and performing slicing processing on the target video based on the display object tag to obtain video segments of different display objects; the display object tag is used for representing the position of the video content of the display object in the target video.

In some possible embodiments, after the generating the presentation video of the target presentation object based on the key frame, the method further comprises:

acquiring advertisement materials from the audio and video resources associated with the key frames;

and carrying out special effect processing on the advertisement materials, and synthesizing the processed advertisement materials into the display video.

In some possible embodiments, the acquiring advertisement materials from the audio-video resources associated with the key frames includes:

identifying key texts in the audio and video resources, and using text types of the key texts and common texts as the advertisement materials;

the special effect processing for the advertisement materials comprises the following steps:

and carrying out special effect processing on the key text according to the text type of the key text, and synchronously displaying the common caption and the image content of the display video in a caption mode.

In some possible embodiments, the identifying the key text in the audio-video resource, the text type of the key text, includes:

extracting text information from the voice information of the display video;

matching the text information with a preset key word set, and taking text content matched with the preset key word set in the text information as the key text;

and taking the text type corresponding to the matched preset key word as the text type of the key text.

performing recognition operation on the voice information of the display video to obtain an emphasized word in the voice signal as the key text;

and determining the text type of the key text based on the intonation type of the emphasized word.

extracting text information from the voice information;

matching the text information with a preset key word set, and taking text content matched with the preset key word set in the text information as a key text;

And carrying out recognition operation on the voice information of the display video by adopting a voice signal recognition technology to obtain the intonation type of the key text as the text type of the key text.

In some possible embodiments, after the synthesizing the processed advertising material into the presentation video, the method further comprises:

identifying the expression of the target portrait in the display video;

and adding an expression special effect for the target portrait based on the expression of the target object.

based on the audio and video resources of the display video, identifying the starting position of the explanation target display object in the display video;

and adding key information of the target display object for the video frame image corresponding to the starting position, wherein the key information comprises an appearance picture of the target display object and/or text description information of the target display object.

In some possible embodiments, after synthesizing the processed advertising material to the presentation video, the method further comprises:

Displaying an editing interface for the display video;

and responding to the editing operation in the editing interface, and editing the display video.

In a second aspect, an embodiment of the present application provides a video processing apparatus, including:

a presentation video parsing module configured to perform obtaining a target video of video content containing a plurality of presentation objects;

the material acquisition module is configured to execute slicing processing on the video to obtain video segments of different display objects;

and the display video synthesis module is configured to execute target display objects in the different display objects, analyze key frames of the target display objects from video segments of the target display objects and generate display videos of the target display objects based on the key frames.

In some possible embodiments, the slicing processing is performed on the video to obtain video segments of different display objects, and the material acquisition module is configured to:

In some possible embodiments, if the target video includes a display object tag, the performing the slicing processing on the target video to obtain video segments of different display objects, and the material acquisition module is configured to:

In some possible embodiments, after executing the generating the presentation video of the target presentation object based on the key frame, the presentation video synthesis module further includes:

the special effect processing unit is configured to acquire advertisement materials from the audio and video resources associated with the key frames;

In some possible embodiments, the obtaining advertisement material from the audio-video resource associated with the key frame is performed, and the special effect processing unit is configured to:

executing the special effect processing on the advertisement material, wherein the special effect processing unit is configured to:

In some possible embodiments, executing the identifying key text in the audio-video resource, text type of key text, the special effects processing unit is configured to:

extracting text information from the voice information of the display video;

extracting text information from the voice information;

In some possible embodiments, the presentation video composition module further comprises:

an expression special effect unit configured to perform expression identifying a target portrait in the presentation video;

an information adding unit configured to execute an audio/video resource based on the presentation video, and identify a start position of the presentation video in which the target presentation object is taught;

In some possible embodiments, the apparatus further comprises:

a video editing module configured to execute an editing interface that exposes the exposed video;

In a third aspect, another embodiment of the present application also provides an electronic device, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect provided by the embodiments of the present application.

In a fourth aspect, another embodiment of the present application further provides a computer storage medium, where a computer program is stored, where the computer program is configured to make a computer execute the method of the first aspect provided by the embodiment of the present application.

A fifth aspect. Another embodiment of the present application also provides a computer program, where the computer program includes computer instructions for causing a computer to perform the method of the first aspect provided by the embodiment of the present application.

According to the embodiment of the application, the key frames containing the display objects are analyzed from the live broadcast playback video segment with the display objects, so that the display video of the display objects is obtained. And then, acquiring advertisement materials of the display video from the audio/video resources of the display video, and synthesizing the advertisement materials into the display video after special effect processing is carried out on the extracted advertisement materials. By the method, the problem that a scheme capable of intelligently producing high-quality creative display video materials in batches is lacking in the related technology is solved as far as possible.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an application environment according to one embodiment of the application;

FIG. 2a is a flowchart of a video processing method according to one embodiment of the present application;

FIG. 2b is a schematic diagram of a video node tag according to one embodiment of the application;

FIG. 2c is a schematic diagram of a key text lift special effect according to one embodiment of the application;

FIG. 2d is a special scene effect diagram of key text according to one embodiment of the application;

FIG. 2e is another special scene effect diagram of key text according to one embodiment of the application;

FIG. 2f is an explosion sticker effect diagram for key text according to one embodiment of the present application;

FIG. 2g is a bubble decal effect diagram of key text according to one embodiment of the present application;

FIG. 2h is a special effect diagram of a word-by-word fly-out of key text from the mouth, according to one embodiment of the application;

FIG. 2i is a special effects diagram of floating key text over a character according to one embodiment of the application;

FIG. 2j is a special effects diagram of key text floating from a video frame according to one embodiment of the present application;

FIG. 2k is a schematic diagram of a key text curtain wall special effect according to one embodiment of the present application;

FIG. 2l is a schematic illustration of an expressive special effect according to one embodiment of the present application;

FIG. 2m is another expressive special effect schematic diagram according to one embodiment of the application;

FIG. 2n is a schematic diagram of a product introduction effect according to one embodiment of the present application;

FIG. 2o is a schematic diagram of a text presentation effect of a closed caption according to one embodiment of the present application;

FIG. 3a is an editing interface diagram of a closed caption text according to one embodiment of the present application;

FIG. 3b is an editing interface diagram of key text according to one embodiment of the application;

FIG. 4 is a schematic diagram of a video processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and furthermore, in the description of the embodiments of the present application, "plural" means two or more than two.

In the description of the embodiments of the present application, unless otherwise indicated, the term "plurality" refers to two or more, and other words and phrases are to be understood and appreciated that the preferred embodiments described herein are for the purpose of illustration and explanation of the present application only, and are not intended to limit the present application, as well as the embodiments of the present application and features of the embodiments may be combined with each other without conflict.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method based on routine or non-inventive labor. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the control device is executing.

The data (such as the material used to make the display video, etc.) to which the present application relates may be data that is authorized by the user or sufficiently authorized by the parties.

The production of the display video in the related art requires a designer to search a large amount of materials based on the requirements of the user to complete the production of the display video. This way of production takes a long period of time when the advertising material is acquired, and the advertising material for producing advertisements has the possibility of being used multiple times. To solve the above problems, the inventive concept of the present application is: and analyzing the key frames of the display objects from the live broadcast playback video segment with the display objects to obtain the display video containing the display objects. The display video obtained in this way contains the video content of the display object. Extracting audio and video resources of the display video, further acquiring advertisement materials from the audio and video resources, performing special effect processing on the advertisement materials, and synthesizing the processed advertisement materials into the display video. The method can greatly reduce the time period for acquiring the advertisement materials, and can be used for processing a large number of live broadcast playback video segments in parallel so as to manufacture a large number of high-quality display videos in batches. The method solves the problem that a scheme capable of intelligently producing high-quality creative display video materials in batches is lacking in the related technology as far as possible.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

A video processing method according to an embodiment of the present application is described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application environment according to one embodiment of the application is shown.

As shown in fig. 1, the application environment may include, for example, a storage system 10, a server 20, and a terminal device 30. The terminal device 30 may be any suitable electronic device for network access including, but not limited to, a computer, a notebook computer, a smart phone, a tablet computer, a smart watch, a smart bracelet, or other type of terminal. The storage system 10 is capable of storing accessed media assets, such as web pages, electronic books, audiovisual files, and the like. The server 20 is configured to implement interaction with the terminal device 30, and obtain media resources from the storage system and return the media resources to the terminal device 30.

In implementation, the server 20 performs an parsing operation on the live playback video segment stored in the storage system 10, parses out video frame images of the display object from the live playback video segment, and performs an integration process on the video frame images to obtain the display video containing the display object.

The server 20 obtains the audio-video resource of the display video from the display video, and obtains the advertisement material corresponding to the audio-video resource by adopting the voice signal recognition technology. The server 20 performs special effects processing on the advertisement materials, and synthesizes the advertisement materials after the special effects processing into the display video. The server 20 transmits the synthesized presentation video to the terminal device 30 through the network 40.

In some possible embodiments, the terminal device 30 may switch the short video for the user or present comment content of the short video for the user based on the user operation.

In some possible embodiments, the terminal device 30, after receiving the presentation video sent by the server 20, can re-edit the special effects processing for the advertisement materials in the presentation video.

The terminal devices 30 of the present application (e.g., between 30_1 and 30_2 or 30_n) may also communicate with each other via the network 40. Network 40 may be a broad network for information transfer and may include one or more communication networks such as a wireless communication network, the internet, a private network, a local area network, a metropolitan area network, a wide area network, or a cellular data network.

In the description of the present application, only a single server or terminal device is described in detail, but it should be understood by those skilled in the art that the single server 20, terminal device 30 and storage system 10 are shown to illustrate that the aspects of the present application relate to the operation of the terminal device, server and storage system. The detailed description of a single terminal device and a single server and storage system is for at least convenience of explanation and does not imply that there are limitations on the number, type, location, etc. of terminal devices and servers. It should be noted that the underlying concepts of the exemplary embodiments of this application are not altered if additional modules are added to or individual modules are removed from the illustrated environment. In addition, although a bi-directional arrow from the storage system 10 to the server 20 is shown in fig. 1 for convenience of explanation, it will be understood by those skilled in the art that the above-described data transmission and reception may be realized through the network 40.

Server 20 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.

In order to facilitate understanding of the video processing method provided by the present application, a live playback video of a commodity recommendation type will be described below as an example. It should be understood that the selection of the live playback video of the commodity recommendation type is only for the convenience of understanding the scheme provided by the present application, and the video content of the live playback video is not limited, and the live playback video of the game type, the science popularization type, the beauty makeup type and the like are all suitable for the scheme.

Fig. 2a is a schematic flow chart of a video processing method according to an embodiment of the present application, including:

step 201: and analyzing a key frame of the display object from a live playback video segment of the display object, and generating a display video of the display object based on the key frame.

The live playback video of the commodity recommendation type can be summarized as that the video content contains a plurality of long videos recommended by commodities and short video clips recorded by a user for a certain commodity when live. Whether the live broadcast playback video contains long video recommended by a plurality of commodities or short video clips recorded for a certain commodity, the live broadcast playback video contains content introduced to commodity recommendation during live broadcast. So that video content including merchandise recommendation introduction can be used as the presentation video.

When acquiring a presentation video based on a live playback video, it is necessary to determine the number of presentation objects contained in the live playback video. In the live broadcast playback video of the commodity recommendation class, the display object is the commodity recommended and introduced in the live broadcast playback video. When executing step 201, feature recognition may be performed on a frame-by-frame basis based on video frames of live playback video, so as to obtain a video segment where each commodity included in the video is located in the live playback video. Further, the video segment corresponding to each commodity is cut out from the live broadcast playback video, and each cut video segment is the display video.

In some possible embodiments, the audio and video resources of the live playback video are analyzed and identified, and a time period of the voice introduced by the commodity recommendation in the video in the live playback video is obtained. And cutting the live broadcast playback video based on a time period corresponding to the voice introduced by commodity recommendation, and cutting the live broadcast playback video according to the time period to obtain video, namely the display video.

The inventors have found that most live playback video in the related art has node tags that contain a profile of the video content within the node. The video content played by the live playback video in each node can be quickly known according to the brief introduction content contained in each node label in the live playback video, and can be specifically shown in fig. 2 b. Based on this, when determining the merchandise contained in the live playback video, the merchandise contained in the live playback video and the video segment corresponding to each merchandise in the live playback video may be determined according to the node tag in the live playback video. Further, by cropping the live broadcast playback video, the display video is obtained.

The display video can be in an advertisement form with both viewing and listening functions, the duration of the video is not suitable to be too long, and the viewer can memorize the video conveniently. Based on the above, the display video obtained by the method is considered to have invalid content which is unsuitable for being used as the display video content, such as that the commodity to be displayed is not in the picture, and the invalid content in the display video needs to be removed, so that the display video is ensured to have shorter video duration and good display effect. During implementation, feature recognition can be performed on the video frames of the display video frame by frame, the identified invalid content is removed, and splicing processing is performed on video segments before and after the invalid content, so that the spliced video content is complete and smooth, and the video duration of the display video is controlled to be in a preset duration, for example: within 1 minute.

It should be noted that, the operation of removing the invalid content in the display video may be performed after the display video is obtained according to the live broadcast playback video, or the display video may be parsed from the live broadcast playback video after the operation of removing the invalid content is performed on the live broadcast playback video. The application does not limit the execution time of the operation of eliminating the invalid content in the video, and only ensures that the invalid content in the display video is not needed.

In view of the related art, the design of the display video requires a designer to review a large amount of advertisement material, and there are cases where the advertisement material used is used by other display videos. According to the application, the advertisement materials are extracted based on the video content of the display video, so that the time cost consumed by searching the advertisement materials is reduced, and the advertisement materials extracted based on the video content of the display video can effectively avoid the problem that the advertisement materials are used by other display videos. The nature of the display video is to better introduce the commodity, and the high-quality advertising materials can attract more viewers to pay attention to the commodity for purchase. In extracting advertising material based on the video content of the presentation video itself, step 202 is performed: and acquiring advertisement materials from the audio and video resources of the display video.

In performing step 202, it includes step A ₁ : determining each key text contained in the display video and text types respectively corresponding to the key texts based on text information corresponding to the voice signals of the display video, and step A ₂ : and determining each key text contained in the display video and the text type corresponding to each key text respectively based on the voice signal of the display video.

In the execution of step A ₁ In this case, the text content (i.e., text information) corresponding to the voice signal of the video can be obtained by using the voice signal recognition technology. The text information contains words for introducing goods, setting aside the atmosphere and interacting with the audience. The words play an important role in helping the audience to know commodities, mobilize the interests of the audience and the like in the display video, and the special effect processing can be better attractive to the audience. Thus, these words can be screened out of the text information as key text.

According to the method, a large number of live broadcast playback videos are analyzed and processed, and key texts common in the live broadcast playback videos are obtained. And classifying the key texts based on the functions of the key texts in the live broadcast video so as to obtain the common key texts in the live broadcast playback video and the text types corresponding to the key texts. And training the neural network model of experimental data (namely, the common key text in the live broadcast playback video and the text type corresponding to the key text) to construct a preset key word set capable of automatically extracting the key text and the text type corresponding to the key text from the text information.

When classifying the key texts, words such as "good score", "second killing" and the like used for representing commodity information can be summarized into information point classes; words such as 'put on shelf', 'thank you support' and the like used for representing interaction with spectators during live broadcast are summarized into interaction types; words such as 'wild' and 'new' which are applicable to the service industry, and words such as 'moisturizing' and 'moistening' which are applicable to the cosmetic industry are summarized as commodity introduction types; the words of special numbers used for representing commodity data, such as 3 jin, 1L and the like are summarized into special number classes; words used for representing action calls, such as "pay attention to me", "pay attention to live broadcast", etc., are generalized to action call classes, and words used for rendering atmosphere, such as "ou" are generalized to word classes.

In some possible embodiments, matching operation is performed on text information of the display video with a preset key word set, a key text matched with the key word set in the text information is identified, and the type of the key text in the preset key word set is determined as the text type of the key text.

In the execution of step A ₂ Before, the method analyzes and processes a large number of live broadcast playback videos, acquires voice signals from audio and video resources of the large number of live broadcast playback videos, extracts characteristics of the voice signals, and recognizes emphasis words representing the emphasis of the host in the voice signals. The emphasized word is the key text.

Furthermore, it is considered that the emphasis words uttered by the anchor in different scenes are different. For example, a host may emphasize commodity advantages (e.g., "true good |"), and mobilize viewer interests with repeated accentuations (e.g., "last 50 |"). Based on the method, the emphasized words can be divided into two types of emphasized words, namely repeated emphasized words and repeated emphasized words, and different special effect processing is carried out on the two types of emphasized words, so that the display video has better display effect.

In some possible embodiments, the speech signal representing the emphasized language is obtained from a large number of live playback videos, and experimental data (i.e., the speech signal of the large number of live playback videos) is subjected to neural network model training, so as to construct an emphasized word recognition network capable of automatically extracting the emphasized word used as the key text and the type of the emphasized word from the speech signal.

In some possible embodiments, the recognition operation is performed on the voice signal of the display video by using the emphasized word recognition network, so as to obtain the emphasized word and the type corresponding to the emphasized word in the voice signal. And taking the emphasized word as a key text, and taking the type corresponding to the emphasized word as the text type of the key text.

Further, consider the above A ₁ 、A ₂ The key texts acquired in two ways should have overlapping parts, for example, the text type of "buy and buy" can be a call of action type or a repeated emphasis type. And determining each key text contained in the display video and the text type corresponding to each key text based on the voice signal of the display video and the text information corresponding to the voice signal.

In implementation, a voice signal recognition technology can be adopted to obtain text information corresponding to a voice signal of the display video. And matching the text information with a preset key word set to obtain a key text in the text information. And acquiring a voice signal corresponding to the key text acquired from the text information, and identifying the emphasized word and the type corresponding to the emphasized word in the voice signal by adopting an emphasized word identification network. And taking the type corresponding to the emphasized word as the text type of the key text.

In addition, when the key text is acquired in the above manner, the time point of the key text in the display video needs to be determined, and when the method is implemented, the display video can be screened frame by frame based on the voice content corresponding to the key text, and the video frame picture after the key text appears is recorded. And the video frame picture of the display video at the time point is searched based on the time point corresponding to the key text, and the key text after special effect processing is added into the video frame picture.

After the advertisement material is obtained in step 202, step 203 is executed: and synthesizing the advertisement materials into the display video after special effect processing. In implementation, specific effect processing is needed to be carried out on the text types of the key texts so as to ensure the display effect of the display video.

When special effect processing is carried out on key texts (such as 'focusing on me', 'focusing on' and the like) with the text types being action calling types, special effects such as 'lifting effect', 'cartoon character special effect' and the like can be set for the key texts. And according to the video time point corresponding to the key text, finding out a video frame picture when the host computer speaks the voice corresponding to the key text, and placing the key text after special effect processing at a remarkable position in the video frame picture according to actual requirements. So as to distinguish from the common captions, and attract the attention of the audience, the specific effect of the special effect of the playing card can be shown in fig. 2 c.

In addition, when special effect processing is performed on the texts of the action call classes, special scene special effects such as 'firework special effect', 'approach special effect' and the like can be added in the video frame picture corresponding to the key texts. In fig. 2d, the special scene effect processing of the firework special effect is performed on the video frame corresponding to the key text "focus me", and in fig. 2e, the special scene effect processing of the carriage approach special effect is performed on the video frame corresponding to the key text "focus me", so that the effect of rendering atmosphere can be achieved through the special effect processing.

When special effects are performed on a key text with a text type emphasized, the voice intonation category of the key word can be identified according to the voice signal identification technology, the key text with the voice intonation expressed as "short, heavy and fast" (such as "Tianna"), the key text with the voice intonation expressed as "one word and one ton" (such as "really good taste") and the key text with the voice intonation expressed as "long voice" (such as "you have no mishearing") are divided, and special effects are added to the key text respectively.

When the voice intonation is expressed as a short, heavy and fast key text, in special effect processing, a special effect of a sticker such as an explosion sticker, a bubble sticker, a flashing sticker and the like can be added to the key text, and a video frame picture when a host talks about the key text is found according to a video time point corresponding to the key text.

The special effect sticker containing the key text can be placed at a remarkable position of the video frame picture so as to achieve the purpose of attracting the attention of audiences. The special effect of the explosive sticker may be as shown in fig. 2f and the special effect of the bubble sticker may be as shown in fig. 2 g.

The voice intonation is expressed as a key text of 'one word is one ton', and when in special effect processing, a corresponding special effect processing mode can be set according to the length of the key text. When the length of the key text is shorter, such as "yes", "really good" and the like, the video frame picture when the anchor teaches the key text can be found according to the video time point corresponding to the key text. The human body position of the anchor in the picture is identified by the human body identification technology, the mouth of the anchor is identified by the human face key point search technology, the key text is thickened and the like, and the key text is arranged to fly out word by word from the mouth of the anchor, wherein the specific effect can be shown in figure 2 h. When the length of the key text is longer, such as 'last sending a packet', after the video frame picture of the key text is acquired, the human body outline of the anchor is acquired through the human body recognition technology, after the key text is thickened, the special effect that the key text loops around from above the outline of the anchor character is set, and the specific effect can be shown in fig. 2 i.

When the voice intonation is expressed as a key text of 'long voice' (if 'yes, no listening error'), special effects such as special fonts, thickening and enlarging and the like can be set for the key text during special effect processing, a display direction is set for the key text, and a video frame picture when a host teaches the key text is found according to a video time point corresponding to the key text. When the anchor speaks the voice corresponding to the key text, the key text transversely floats from the live broadcast frame picture according to the appointed display direction, and the specific effect can be shown as a figure 2 j.

When special effects processing is performed for a key text (such as "last 50 | last 50") whose text type is repeatedly emphasized, a caption wall special effect may be added for the key text. When the method is implemented, the video frame picture when the anchor teaches the key text can be found according to the video time point corresponding to the key text. Since the speech segmentation technique is capable of semantic understanding for each pixel in a video frame, pixels of the same semantic are segmented into the same part. Therefore, semantic segmentation processing can be performed on the video frame to segment out the anchor portrait in the video frame. After the main broadcasting portrait is segmented, the key text added with the special effect of the caption wall is arranged to transversely drift through the video frame picture, and the main broadcasting portrait is added into the video frame picture as a foreground. So as to ensure that the character curtain wall does not shade the image of the anchor during video playing.

It should be understood that the special effect processing for the key text in the present application is only an example, and the implementation can set corresponding special effect processing modes for the key text of different text types according to practical situations, which is not limited by the present application.

In some possible embodiments, the caption wall special effects may be set according to the length and the repetition number of the key text, the content of the key text is formed into more than n rows of caption walls, and the value of n may be set based on the actual effect, for example, n rows are repeated for n times. The implementation can be as shown in fig. 2 k.

To be suitable for more application scenes, special effects can be added according to the video content of the display video. When the method is implemented, facial expressions of the faces in the displayed video can be identified and displayed frame by frame through a facial feature identification technology. And adding special effects such as laughing, crying, shy and the like to the person in the video frame picture based on the facial expression of the person according to the actual demand. In fig. 2l, the specific expression of laughing is intelligently adjusted for the face of the anchor according to the current laughing expression of the anchor in the video frame. In fig. 2m, according to the current crying expression of the video frame picture, the crying expression is intelligently adjusted for the face of the anchor.

In some possible embodiments, a point in time when the anchor details the merchandise is identified from an audio-video resource that presents the video, and a video frame of the anchor when the merchandise is detailed is determined based on the point in time. And adding a commodity detail picture corresponding to the explanation commodity in the video frame picture. The commodity detail picture can comprise the appearance of the commodity and the key word introduction of the commodity. The specific effect may be as shown in fig. 2 n.

In addition, the text information of the display video, excluding the key text, can be handled as a normal subtitle text. When special effect processing is carried out, the common caption text can be displayed in a caption mode to a video frame picture corresponding to the common caption text according to a video time point corresponding to the common caption text. And the preset length of each line of caption is set in a self-adaptive mode, and when the length of the common caption text reaches the preset length, line feed display is needed. By the method, synchronous display of the voice corresponding to the subtitle following subtitle can be realized, and self-adaption line feed is realized. Specifically, as shown in fig. 2o, when the common caption text in fig. 2o exceeds a preset length, the line feed is automatically displayed.

In some possible embodiments, displaying all contents in the text information in a video frame picture corresponding to the text information in a subtitle mode, adaptively setting a preset length of each line of subtitle, and when the current display length of the text information reaches the preset length, displaying the text information in a line-feed mode.

Note that, in step 203: when the advertisement materials are synthesized into the display video after special effect processing, the positions of the figures and the goods in each video frame picture are required to be determined frame by frame based on a semantic segmentation technology. And marking the positions of the figures and the commodities. During synthesis, special effects need to be added at places other than feature labels. When the characteristics can be moved (such as 'subtitle wall', 'crossline drifting', etc.), the voice segmentation technique is required to segment the figures and goods in the video frame picture, and the images and goods are used as the foreground. Through the processing, the situation that the advertisement materials after special effect processing block the figures and goods in the video picture during display can be avoided.

In view of the fact that the display effect of the display video intelligently processed by the method may deviate from the user's expectations, in order to improve the satisfaction of the user, the method may be set to execute step 203: and after the special effects of the advertisement materials are processed and synthesized into the display video, providing an editing interface of the display video for a user. The editing page can provide a user with a self-defined editing function, and the display video is edited based on the editing operation of the user in the editing page.

In some possible embodiments, the special effects processing manner for the key text and the subtitle text in step 203 is integrated into a special effects material library. And respectively displaying the key text and the common caption text determined based on the text information of the display video on an editing interface of the key text and an editing interface of the common caption text, wherein the editing interface of the key text and the editing interface of the common caption text can call a special effect processing mode in a special effect material library so that a user can carry out special effect processing on each key text and each common caption text.

In some possible embodiments, the interface provided for the user to edit the common caption text may specifically include a caption text selection area, a content editing area, a first special effect adding area, and a caption presentation area, as shown in fig. 3 a. The subtitle text selection area contains each of the common subtitle text in the presentation video, and the user can perform custom editing on the common subtitle text by clicking any common subtitle text in the area. For example, after the user clicks on the "good taste" of the common caption text in the caption text selection area, the text content of the common caption text is displayed in the content editing area. The user can edit, i.e., customize, the text of the closed caption for the text content of the region.

And, the user can adjust the font size, the font size type, and the font color of the subtitle text in the first special effect adding area. If the user selects a piece of common caption text in the caption text selection area, the special effect set by the user in the first special effect adding area is modified only for the selected common caption text. If the user does not select the common caption text in the caption text selection area, the special effects set by the user in the first special effect adding area are modified for all the common caption texts. After the user selects the special effect setting for the common caption text, the caption display area automatically displays the common caption text after special effect processing.

In some possible embodiments, the interface provided for the user to edit the key text may specifically include a key text selection area, a second special effect adding area, and a special effect display area as shown in fig. 3 b. The key text selection area contains each key text in the presentation video, and the user can perform custom editing on one key text in the area by clicking on the key text. After clicking one key text in the key text selection area, the user can display special effect processing corresponding to the key text in the special effect editing library in the second special effect adding area. For example, after the user selects the key text "pay attention to me" in the key text selection area, the user may add special effects such as "lift effect", "firework special effect" to the key text. When the user selects the special effect in the second special effect adding area, the display effect after the selected special effect processing is performed on the key text can be automatically displayed in the special effect displaying area.

By the method, the special effect processing method of the advertisement materials can be provided for users. The user can conveniently manufacture the display video according to the preference of the user, and the satisfaction degree of the user is improved.

Based on the same inventive concept, the present application also provides a video processing apparatus 400, as shown in fig. 4, comprising:

a presentation video parsing module 401 configured to perform acquiring a target video of video content containing a plurality of presentation objects;

the material acquisition module 402 is configured to execute slicing processing on the video to obtain video segments of different display objects;

a display video synthesis module 403 configured to execute a display video for a target display object of the different display objects, parse a key frame of the target display object from a video segment of the target display object, and generate the display video of the target display object based on the key frame.

extracting text information from the voice information of the display video;

extracting text information from the voice information;

In some possible embodiments, the apparatus further comprises:

Having introduced the apparatus provided by the present application, an electronic device 130 according to this embodiment of the present application is described below with reference to fig. 5. The electronic device 130 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 5, the electronic device 130 is in the form of a general-purpose electronic device. Components of electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 connecting the various system components, including the memory 132 and the processor 131.

Bus 133 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

Memory 132 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 130, and/or any device (e.g., router, modem, etc.) that enables the electronic device 130 to communicate with one or more other electronic devices.

Such communication may occur through an input/output (I/O) interface 135. Also, electronic device 130 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 130, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In some possible embodiments, aspects of a video processing method provided by the present application may also be implemented in the form of a program product comprising program code for causing a computer device to carry out the steps of a video processing method according to the various exemplary embodiments of the application as described in the present specification, when the program product is run on a computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for video processing of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of video processing, the method comprising:

aiming at target display objects in the different display objects, analyzing key frames of the target display objects from video segments of the target display objects, and generating display videos of the target display objects based on the key frames;

performing recognition operation on the voice information of the display video, taking the emphasized words in the voice information as key texts, and taking the voice information which is not taken as the key texts as common texts;

determining the text type of the key text based on the intonation type of the emphasized word; special effect processing is carried out on the advertisement materials, and the processed advertisement materials are synthesized into the display video; wherein the advertisement material comprises the key text, the text type of the key text and the common text;

2. The method according to claim 1, wherein the slicing the video to obtain video segments of different display objects comprises:

3. The method according to claim 1, wherein if the target video includes a display object tag, the slicing processing is performed on the target video to obtain video segments of different display objects, including:

4. The method of claim 1, wherein after the generating the presentation video of the target presentation object based on the keyframes, the method further comprises:

Extracting text information from the voice information of the display video;

5. The method of claim 1, wherein after the generating the presentation video of the target presentation object based on the keyframes, the method further comprises:

extracting text information from the voice information;

6. The method of claim 1, wherein after the synthesizing the processed advertising material into the presentation video, the method further comprises:

Identifying the expression of the target portrait in the display video;

7. The method of claim 1, wherein after the synthesizing the processed advertising material into the presentation video, the method further comprises:

8. The method of any of claims 1-7, wherein after synthesizing the processed advertising material into the presentation video, the method further comprises:

displaying an editing interface for the display video;

9. A video processing apparatus, the apparatus comprising:

a display video synthesis module configured to execute a display video for a target display object in the different display objects, to parse a key frame of the target display object from a video segment of the target display object, and to generate the target display object based on the key frame;

the video synthesis module further comprises a special effect processing unit, wherein the special effect processing unit is configured to perform recognition operation on the voice information of the display video, take the emphasized words in the voice information as key texts, and take the voice information which is not taken as the key texts as common texts;

10. The apparatus of claim 9, wherein the slicing the video is performed to obtain video segments of different presentation objects, and wherein the material acquisition module is configured to:

11. The apparatus of claim 9, wherein if the target video includes a display object tag, the performing the slicing process on the target video obtains video segments of different display objects, and the material obtaining module is configured to:

12. The apparatus of claim 9, wherein the special effects processing unit is further configured to:

extracting text information from the voice information of the display video;

13. The apparatus of claim 9, wherein the special effects processing unit is further configured to:

extracting text information from the voice information;

14. The apparatus of claim 9, wherein the presentation video composition module further comprises:

15. The apparatus of claim 9, wherein the presentation video composition module further comprises:

16. The apparatus according to any one of claims 9-15, wherein the apparatus further comprises:

17. An electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A computer storage medium, characterized in that the computer storage medium stores a computer program for causing a computer to perform the method according to any one of claims 1-8.