CN115022732A

CN115022732A - Video generation method, device, equipment and medium

Info

Publication number: CN115022732A
Application number: CN202210583689.XA
Authority: CN
Inventors: 贺欣; 谢佳雯; 陈建宇; 吴春松; 刘延朋; 常小军; 熊成; 刘成; 赵翊腾; 姜永刚; 李金�; 陈炳辉; 包季真; 黄博翔
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-06
Anticipated expiration: 2042-05-25
Also published as: CN115022732B

Abstract

The application provides a video generation method, a device, equipment and a medium, wherein the video generation method comprises the following steps: responding to a video generation request of a client, and acquiring an original video related to a recommended object; performing multi-modal feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video; and processing the original video according to the image identification information and the text segment to obtain at least one target video, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video. Therefore, the target video is automatically extracted based on the original video, the original video is not required to be manually edited into one or more target videos by a user, the video generation efficiency is improved, the video generation cost is reduced, and meanwhile, the image identification information and the text segment obtained by multi-mode feature identification are utilized in the extraction process of the target video, so that the video generation quality is ensured.

Description

Video generation method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video generation method, apparatus, device, and medium.

Background

In recent years, with the rapid development of mobile internet technology and related infrastructure, mobile internet users are becoming more and more accustomed to watching short videos, and the application programs related to the short videos occupy most of the surfing time of the mobile internet users. Under the background, the business related to the short video under the scene of the e-market is rapidly developed, and the short video brings a large amount of free goods carrying flow to merchants on line.

In the related technology, a merchant uses an original video with a long shooting time to manually perform complex editing operation on the original video with the shooting time of several hours in an editing tool, and the process consumes a large amount of time of editing personnel, so that the production efficiency of the short video is low, the production cost is high, and further part of merchants cannot develop short video operation business or reduce the short video operation business due to cost limitation.

Therefore, how to produce high-quality short videos with high efficiency and low cost is a problem to be solved.

Disclosure of Invention

The application provides a video generation method, a video generation device, video generation equipment and a video generation medium, which are used for solving the problem of how to efficiently produce high-quality short videos at low cost.

In a first aspect, an embodiment of the present application provides a video generation method, applied to a server, including: responding to a video generation request of a client, and acquiring an original video related to a recommended object; performing multi-modal feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video; and processing the original video according to the image identification information and the text segment to obtain at least one target video, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video.

In a second aspect, an embodiment of the present application provides a video generation method, applied to a client, including: responding to the interactive operation of a user for an original video related to a recommended object, and sending a video generation request to a server to request video generation based on the original video; and receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on the image identification information of the image frames in the original video and the text segment corresponding to the voice in the original video.

In a third aspect, an embodiment of the present application provides a video generating apparatus, including: the system comprises an acquisition unit, a recommendation unit and a recommendation unit, wherein the acquisition unit is used for responding to a video generation request of a client and acquiring an original video related to a recommended object; the recognition unit is used for performing multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video; and the extraction unit is used for processing the original video according to the image identification information and the text segment to obtain at least one target video, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video.

In a fourth aspect, an embodiment of the present application provides a video generating apparatus, including: a sending unit, configured to send a video generation request to a server to request video generation based on an original video in response to an interactive operation of a user on the original video related to a recommendation object; and the receiving unit is used for receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on the image identification information of the image frames in the original video and the text segment corresponding to the voice in the original video.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the electronic device to perform the video generation method provided by the first aspect and/or the second aspect of the present application.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a video generation method provided in the first aspect and/or the second aspect of the present application.

In a seventh aspect, an embodiment of the present application provides a computer program product, where the computer program product includes: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the video generation method provided by the first aspect and/or the second aspect of the present application.

According to the technical scheme, in the embodiment of the application, multi-modal feature recognition is carried out on the original video related to the recommended object to obtain the image recognition information of the image frames in the original video and the text segments corresponding to the voice in the original video, the original video is processed according to the image recognition information and the text segments to obtain at least one target video, and the target video is one video segment in the original video or a combination of a plurality of video segments in the original video. Therefore, the embodiment of the application realizes automatic extraction of the target video, namely, automatic extraction of the short video, improves the generation efficiency of the short video, reduces the generation cost of the short video, and can extract the short video containing effective content from the original video based on the image information and the text information obtained by multi-modal feature recognition, thereby improving the quality of the short video.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a scene schematic diagram of a video generation method according to an embodiment of the present application;

fig. 2 is a first flowchart of a video generation method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 4 is a schematic flowchart illustrating a video generation method according to an embodiment of the present application;

fig. 5 is a block diagram of a video generating apparatus 50 according to an embodiment of the present application;

fig. 6 is a block diagram of a video generating apparatus 60 according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprises," "comprising," and "having," and any variations thereof, in the description and claims of this application and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms related to the embodiments of the present application are explained:

short video: for videos with video duration less than a duration threshold (e.g., 5 minutes, 10 minutes), the duration threshold specified for short videos may be different for different short video applications. In the E-commerce field, commodities are introduced to a user quickly through short videos, so that the time of the user is saved, and the characteristics of the commodities can be highlighted.

In the related art, a short video is generated by manually editing an original video as long as several hours, resulting in low production efficiency and high production cost of the short video. If the original video is simply divided into a plurality of short videos, the quality of the short videos cannot be guaranteed.

In order to solve the above problem, embodiments of the present application provide a video generation method, apparatus, device, and medium. In the embodiment of the application, multi-modal feature recognition is carried out on an original video related to a recommended object to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video, and the original video is processed based on the image recognition information of the image frames in the original video and the text segments corresponding to the voices in the original video to obtain at least one target video. Therefore, the target video with shorter time duration is automatically extracted from the original video, namely, the automatic extraction of the short video is realized, manual editing is not needed, the generation efficiency of the short video is improved, and the generation cost of the short video is reduced; in addition, based on image information related to image frames in the original video and text information related to voice in the original video, short video containing effective content can be extracted from the original video, and the quality of the short video is improved. Therefore, the embodiment of the application effectively solves the problem of how to generate high-quality short video with high efficiency and low cost.

Optionally, the recommendation object includes a commodity object. Therefore, the target video related to the commodity object is automatically extracted from the original video related to the commodity object, the generation efficiency of the short video of the commodity object and the quality of the short video of the commodity object are improved, and the generation cost of the short video of the commodity object is reduced. Therefore, by means of the method and the device, on one hand, a merchant can be helped to quickly produce the short videos of the commodities with goods in an almost zero-cost mode, sales of the merchant are increased, on the other hand, supply of the short videos is increased, and watching requirements of consumers on the short videos can be better met.

Optionally, the original video related to the recommended object includes a live video introducing the recommended object. Therefore, the target video of the recommended object is automatically extracted based on the live video, namely the characteristics of the recommended object are introduced by the live video, the short video of the recommended object is automatically generated based on the live video, a user does not need to shoot video materials specially for the short video, the generation efficiency of the short video is effectively improved, the quality of the short video is ensured, and the generation cost of the short video is reduced. In particular, in the case where the recommended object includes a commodity object, the original video related to the commodity object includes a live video that introduces the commodity object.

Fig. 1 is a schematic view of an application scenario of a video generation method provided in an embodiment of the present application. As shown in fig. 1, the apparatus performing the video generating method is a video generating apparatus, and the video generating apparatus may be connected to a client.

A client may be any computing device with certain data processing capabilities. At this time, the basic structure of the client may include: at least one processor. The number of processors depends on the configuration and type of the client. The client may also include a memory, which may be volatile, such as RAM, or non-volatile, such as Read-Only memory (ROM), flash memory, etc., or may include both types. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. In addition to the processing unit and the memory, the client includes some basic configurations, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, mouse, stylus, printer, etc. Other peripheral devices are well known in the art and will not be described in detail herein. Alternatively, the client may be a pc (personal computer) terminal, a handheld terminal (e.g., a smart phone, a tablet computer), or the like.

The video generation apparatus may be a device that provides video processing services in a network virtual environment. Optionally, the video generation apparatus may be an apparatus deployed with a model for recognizing multimodal features, and in the video generation apparatus, image frames in an original video and voices in the video may be recognized based on the deployed model, and then the original video is processed based on a recognition result to obtain at least one target video.

In physical implementation, the video generating apparatus may be any device capable of providing a computing service, responding to a service request, and performing processing, for example: can be cluster servers, regular servers, cloud hosts, virtual centers, and the like. The video generating device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and is similar to a general computer framework.

The client can be in network connection with the video generation device, and the network connection can be wireless or wired. If the client is communicatively connected to the video generating apparatus, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, 5G, and the like.

In this embodiment of the application, the client may send a video generation request to the video generation device to request the video generation device to extract a video segment based on the original video, so as to obtain at least one target video. Optionally, the video generation apparatus may return a message that the target video extraction is completed to the client, and may also return at least one target video extracted from the original video to the client.

Preferably, in an application scenario, the client sends a video generation request to the video generation device to request the video generation device to generate a short video of the commodity object based on a live video of the commodity object; the video generation device responds to the video generation request, obtains live videos of the commodity objects, and extracts at least one short video from the live videos in a multi-mode feature recognition mode. Then, optionally, the video generation apparatus returns a message that the short video extraction is completed to the client, or returns at least one short video extracted from the live video to the client.

The above is only an exemplary application scenario. Besides videos of commodity objects, the embodiment of the application can be applied to video extraction of other recommended objects in the internet, such as video extraction for recommending television series, video extraction for recommending books, and the like.

The technical solution of the present application will be described in detail by specific embodiments with reference to the accompanying drawings. It should be noted that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a first flowchart of a video generation method provided in an embodiment of the present application, where the method is applicable to a server. As shown in fig. 2, a video generation method provided in the embodiment of the present application includes:

s201, in response to a video generation request of a client, obtaining an original video related to a recommended object.

The video generation request is used for requesting to extract at least one target video from the original video related to the recommended object. Accordingly, the video generation request may include video information (such as a video name, a video storage address, a video capturing time, etc.) of the original video related to the recommended object and/or the original video related to the recommended object, so that the server may accurately obtain the original video based on the video generation request.

The original video related to the recommended object refers to the video content of the original video related to the recommended object, and particularly, the original video related to the recommended object may include the original video showing, introducing and/or commenting the recommended object. As described in the foregoing, optionally, the original video related to the recommended object includes a live video of the recommended object, and therefore, when the recommended object is a commodity object, the original video related to the recommended object may include a live video of the commodity object.

In this embodiment, the client may send a video generation request to the server in response to the user's interactive operation. After receiving a video generation request from a client, a server may obtain an original video related to a recommended object from the client, obtain an original video related to the recommended object from another device (e.g., a storage device), and obtain an original video related to the recommended object that is locally stored by the server in response to the video generation request. Wherein, the original video related to the recommended object can be one or more.

In one possible implementation, the client sends a video generation request to the server in response to the end of the original video shot associated with the recommended object. Therefore, the target video is timely extracted from the original video after the original video shooting is finished, the video generation efficiency is improved, and the user experience is improved. For example, a user live broadcasts and records live videos on a client, clicks to finish recording after live broadcasting is finished, the client sends a video generation request to a server in response to interactive operation of clicking to finish recording, and the server automatically extracts at least one target video from the recorded live videos at the back end.

In a possible implementation manner, the server obtains an original video related to the recommended object from the client, may obtain the original video related to the recommended object from the video generation request, and the server may also return the video acquisition request to the client in response to the video generation request, so as to obtain the original video related to the recommended object that is returned by the client in response to the video acquisition request.

S202, performing multi-modal feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video.

Where multi-modal features refer to features of multiple forms (such as images, text, speech, etc.).

The image recognition information of the image frame may include object information of a target object recognized in the image frame, the target object may include a recommended object and/or a person, the object information of the recommended object may include an image position of the recommended object and an initial category of the target object, and the object information of the person may include one or more of an image position of the person, a gender of the person, and a face image of the person.

For example, the image position and the initial category of the recommended object are identified in some image frames, the image position of the person is identified in other image frames, and the image position of the recommended object, the initial category of the recommended object, and the image position of the person are identified in still other image frames.

Alternatively, in the case where the original video related to the target object is a live video of the commodity object, the image recognition information of the image frame may include object information of the commodity object in the image frame and/or object information of a person in the image frame. The object information of the commodity object may include an initial category (such as jacket, trousers, shoes) of the commodity object and an image position of the commodity object, among others.

The original video can include multiple voice segments, one or more text segments can be recognized from the multiple voice segments, and the text segments are recorded with characters described by the voice.

In this embodiment, different feature recognition models may be used for features of different modalities. Aiming at an original video, an image recognition model can be adopted to recognize image frames in the original video to obtain image recognition information of the image frames in the original video; the speech recognition model can be adopted to recognize the speech in the original video to obtain the text segment corresponding to the speech in the original video. Here, the specific model structure of the image recognition model and the voice recognition model is not limited.

S203, processing the original video according to the image identification information of the image frames in the original video and the text segments corresponding to the voice in the original video to obtain at least one target video.

The target video is one video segment in the original video or a combination of a plurality of video segments in the original video. When the target video is multiple, all the target videos may be one video segment in the original video, or all the target videos may be a combination of multiple video segments in the original video, or a part of the target videos may be one video segment in the original video, and another part of the target videos may be a combination of multiple video segments in the original video.

In this embodiment, in the original video, not all video contents are related to the recommended object, and considering that the duration of the target video is shorter than that of the original video, the video contents unrelated to the recommended object in the original video may be regarded as invalid contents, and the video contents related to the recommended object may be regarded as valid contents, and the video quality of the target video may be ensured by determining the valid contents in the original video. In the original video, the image identification information of the image frame and the text segment corresponding to the voice in the original video can both reflect the video content of the original video, so that the effective content in the original video can be identified based on the image identification information of the image frame in the original video and the text segment corresponding to the voice in the original video, and the original video is processed based on the effective content in the original video to obtain at least one target video. In the process of processing the original video, video clip extraction can be performed on the original video based on the effective content of the original video, and finally, at least one target video is obtained based on the video clip extracted from the original video.

In one possible implementation, the effective content in the original video may include effective explanation information, i.e., effective explanation information for the recommended object. For example, in a live video of a commodity object, an explanation of characteristics such as advantages and materials of the commodity object belongs to effective explanation information, and an interaction with a bullet screen does not belong to the effective explanation information. Therefore, the candidate text segment containing the effective explanation information in the text segment can be determined based on the image identification information of the image frame in the original video and the text segment corresponding to the voice in the original video, and the original video is processed based on the candidate text segment to obtain the target video. Therefore, by determining the candidate documents containing the effective explanation information, the accuracy of identifying the effective contents in the original video is improved, the accuracy of extracting the target video from the original video is further improved, and the quality of the target video is ensured.

In this implementation, the image identification information of the image frame in the original video includes the initial category of the recommended object and the image position of the recommended object, so that it can be determined whether the text segment includes the explanation information related to the recommended object based on the image identification information of the image frame in the original video, and if so, it can be determined that the text segment belongs to a candidate text segment including valid explanation information. For example, the image recognition information of the image frame includes an initial type of a recommended object as a jacket, and it is determined whether the text segment includes explanation information related to the jacket, for example, the size, color, material, and the like of the jacket. Therefore, the accuracy of screening the candidate text segments in the text segments is improved, and the quality of the target video is further improved.

In yet another possible implementation manner, candidate image frames containing valid image content (for example, image frames of a recommendation object are identified) may be determined based on image identification information of image frames in the original video, and candidate text segments containing valid explanation content may be determined based on text segments corresponding to voices in the original video. And combining the candidate image frames containing the effective image content and the candidate text segments containing the effective explanation content to extract at least one target video from the original video. Therefore, the quality of the target video is improved by combining the screening of the image frame and the screening of the candidate text segment. In the process of extracting at least one target video from an original video by combining a candidate image frame containing effective image content and a candidate text segment containing effective explanation content, at least one target video can be obtained by clipping in the original video based on time information corresponding to the candidate image frame and time information corresponding to the candidate text segment.

In the embodiment of the application, in response to a video generation request of a client, multi-modal feature recognition is performed on an original video related to a recommended object to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video; at least one target video is extracted from the original video by identifying valid content in the original video based on the image identification information and the text segment. Therefore, the target video is automatically extracted from the original video, particularly the short video is automatically extracted from the live video, the generation efficiency of the short video is effectively improved, the generation cost of the short video is reduced, the quality of the short video is improved, and the high-quality short video is generated efficiently at low cost.

Fig. 3 is a schematic flowchart of a video generation method according to an embodiment of the present application. As shown in fig. 3, a video generation method provided in the embodiment of the present application includes:

s301, in response to a video generation request of a client, acquiring an original video related to a recommended object.

S302, performing multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video.

The implementation principle and the technical effect of S301 to S302 may refer to the foregoing embodiments, and are not described again.

S303, according to the image identification information of the image frame in the original video, carrying out category prediction on the recommended object in the original video to obtain the target category to which the recommended object in the original video belongs.

In this embodiment, in the multi-modal feature recognition, the initial category of the recommendation object may be recognized in the image frame, but in an actual scene, the recommendation objects are often further classified in detail according to the style to which the recommendation objects belong and the group to which the recommendation objects are applicable, and when explaining the recommendation objects, the explanation words applicable to the recommendation objects of different classifications are different. Taking a commodity object as an example, in the multi-mode feature recognition, the recommended object in the image frame is identified as a jacket, on the e-commerce platform, the jacket can be further divided into different categories, such as women's clothing, men's clothing, children's garments, T-shirts, long sleeves, academia, professional clothing and the like, and different categories are suitable for different explanation words. Therefore, after multi-modal feature recognition, the category of the recommended object needs to be further predicted based on the image recognition information of the image frame to obtain the target category of the recommended object, so that the accuracy of judging the subsequent effective explanation content is improved according to the target category of the recommended object.

In this embodiment, after the image identification information of the image frame in the original video is obtained, since the image identification information includes the initial category of the recommended object and the image position of the recommended object, the category of the recommended object appearing in the original video may be predicted based on the initial category of the recommended object and the image position of the recommended object in the image identification information, so as to obtain the target category of the recommended object in the original video. In one mode, the category range of the recommended object may be obtained based on the initial category of the recommended object in the image recognition information, and for example, if the initial category is a jacket, the category range including a woman's jacket, a man's jacket, or the like may be specified; then, the recommended object can be further identified in the category range based on the image position of the recommended object in the image identification information, so that the target category of the recommended object is obtained, and the accuracy of the target category is improved.

In a possible implementation manner, the object information of the person in the image identification information may be used to assist the prediction of the category of the recommended object in the original video, so as to further improve the accuracy of the target category. For example, if the commodity object is a jacket worn by a person, the gender, image position, face image, etc. of the person may be used to assist in determining whether the jacket is a woman jacket, a man jacket, a child jacket, etc.

In yet another possible implementation manner, a text segment corresponding to the speech in the original video may be used to assist the category prediction of the recommended object in the original video, so as to further improve the accuracy of the target category. The method can be used for carrying out keyword identification in the text segment, the keywords are related to the category of the recommended object, and the category of the recommended object in the original video is predicted by combining the keywords identified from the text segment overlapped with the time information of the image frame and the image identification information of the image frame, so that the target category to which the recommended object in the original video belongs is obtained. For example, if the recommended object is a commodity object, "college style" or "girl" is recognized in the text fragment, and if the initial category of the commodity object in the image recognition information of the image frame is "shoe," the target category of the commodity object in the image frame may be determined to be the college style woman's shoe.

In yet another possible implementation, a knowledge graph is constructed in advance, and the knowledge graph includes images of recommended objects in at least one category. At this time, S303 includes: and carrying out image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information of the image frame in the original video, and determining the target category of the recommended object in the original video according to the image matching result. Therefore, the accuracy of the target category is improved based on the pre-constructed knowledge graph and the image matching mode.

Wherein, in the knowledge-graph, a category may correspond to one or more images of the recommended object. The image identification information of the image frame may be embodied in an image form, for example, an image position of the recommendation object, an initial category of the recommendation object, an image position of the person, a gender of the person, and the like are marked on the image frame.

In the implementation mode, the image of the recommended object under at least one category in the knowledge graph is matched with the image identification information of the image frame in the original video to obtain the similarity between the image of the recommended object under at least one category and the image identification information of the image frame. Then, among the images of the recommended object under at least one category, an image with the highest similarity with the image identification information of the image frame is determined, and the target category of the recommended object on the image frame is determined to be the category to which the image with the highest similarity belongs.

In the case that there are many types of recommended objects in the knowledge map, if the images of the recommended objects under the respective types are matched with the image identification information of the image frames one by one, the efficiency of the type prediction will be low, and in order to improve the efficiency of the type prediction, it is further optional that S303 may adopt the following manner: acquiring images of the recommended objects under the candidate categories in a knowledge graph according to the initial categories of the recommended objects in the image identification information of the image frames; and performing image matching on the image of the recommended object under the candidate category and the image identification information of the image frame, and determining the target category according to the image matching result.

Wherein the candidate categories are associated with initial categories, one initial category may correspond to one or more candidate categories. Further, the candidate category may belong to the initial category, in other words, the candidate category may be a sub-category of the initial category. For example, when the initial category is "shoes", the candidate categories may include "high heels", "cloth shoes", "sneakers", "leather shoes", and the like.

In this alternative, at least one candidate category may be determined from the categories of the recommended object included in the knowledge map according to the initial category of the recommended object in the image identification information of the image frame, and an image of the recommended object under the candidate category may be acquired in the knowledge map. Then, the image of the recommended object in the candidate category may be image-matched with the image identification information of the image frame, so as to obtain the similarity between the image of the recommended object in the candidate category and the image identification information of the image frame. Then, among the images of the recommended object under the candidate category, an image with the highest similarity to the image identification information of the image frame may be determined, and the target category of the recommended object on the image frame may be determined to be the category to which the image with the highest similarity belongs. Therefore, based on the initial category of the recommended object, the efficiency and accuracy of determining the target category of the recommended object are improved.

Optionally, in the case that the recommended object is a commodity object, the knowledge graph is a commodity knowledge graph. An image of a commodity object under at least one category may be included in the commodity knowledge map. Items such as men's clothing, women's clothing, household goods, home appliances, etc. may also be further subdivided and are not described in detail herein.

S304, according to the text segment corresponding to the voice in the original video and the target category of the recommended object in the original video, identifying effective explanation information of the text segment corresponding to the voice in the original video to obtain a candidate text segment containing the effective explanation information.

In this embodiment, the recommended objects of different categories have different characteristics, that is, different attributes, for example, the attributes of the garment may include "upper body slimming", "all cotton material", "silk material", and the like, and the attributes of the shoe may include "shoe size is accurate", "shock absorption and buffering performance is good", "shoe sole antiskid", and the like. After the text segment corresponding to the voice in the original video and the target type of the recommended object in the original video are obtained, the text segment corresponding to the voice in the original video can be identified by the effective explanation information based on the attribute corresponding to the target type of the recommended object in the original video, so that the text segment containing the effective explanation information is obtained, and the candidate text segment is determined to be the text segment containing the effective explanation information.

In one possible implementation manner, a knowledge graph is constructed in advance, and the knowledge graph comprises attributes of recommended objects in at least one category. The knowledge graph may be the knowledge graph in the foregoing embodiment, and in this case, the knowledge graph includes images and attributes of the recommended object in at least one category. In the case that the knowledge graph is a commodity knowledge graph, the attributes of the recommended object may further include a selling point of the commodity object. Taking the clothing commodity object as an example, the commodity knowledge graph may include attributes of the commodity object such as commodity material, style, color, style, and the like, and may further include selling points to be introduced for the commodity object during live broadcasting.

Based on the knowledge-graph including attributes of the recommended objects under at least one category, S304 includes: in a knowledge graph, acquiring attributes of recommendation objects under a target category; and performing text matching on the text segment corresponding to the voice in the original video and the attribute of the recommendation object in the target category to obtain a text segment containing the attribute of the recommendation object in the target category, and determining the candidate text segment as the text segment containing the attribute of the recommendation object in the target category. Therefore, the accuracy of screening candidate text segments in the text segments is improved by combining a knowledge graph and a text matching mode, and the quality of a target video is further improved.

In the implementation mode, after the target category of the recommended object in the original video is determined, the attribute of the recommended object under the target category is obtained from the knowledge graph; performing text matching on the text segment corresponding to the voice in the original video and the attribute of the recommendation object in the target category, and determining whether the text segment corresponding to the voice in the original video contains the attribute of the recommendation object in the target category; if the text segment corresponding to the voice in the original video contains the attribute of the recommendation object in the target category, the text segment can be regarded as the text segment containing the effective explanation information, and the text segment is determined to be the candidate text segment.

S305, processing the original video according to the candidate text segments to obtain a target video.

The time information may include a start time and a duration of the text segment, or the time information may include a duration and an end time of the text segment. Because the text segment corresponds to the voice in the original video, the time information of the voice in the original video can be marked to the corresponding text segment in the process of identifying the voice in the original video to obtain the text segment.

In this embodiment, after the candidate text segment is determined, video segment extraction may be performed on the original video based on the time information labeled on the candidate text segment, and the target video is obtained based on the extracted video segment. In one mode, video segments located between the time information marked by the candidate text segments can be extracted from the original video, so as to obtain the target video. Since the candidate text segment contains valid explanation content, the target video contains valid video content.

In one possible implementation, S305 includes: screening the candidate text segments according to the content quality requirement; and processing the original video according to the screened candidate text segments to obtain the target video. Therefore, after the candidate text segments are obtained, the candidate text segments are further screened to be inferior, namely the candidate text segments which do not meet the content quality requirement are removed, and the quality of the target video is further improved.

The content quality requirement can be embodied as words or sentences which are preset to influence the introduction effect of the recommendation object. In the case that the recommended object is a commodity object, the content instruction requirement may be embodied as words or sentences affecting the delivery effect, such as words or sentences related to the price of the commodity, the interaction between the anchor and the audience in the live broadcast process, the interaction between the anchor and the guests in the live broadcast process, and the like.

In this implementation, the words or sentences included in the content quality requirement may be matched with the candidate text segments, and it is determined whether the words or sentences included in the content quality requirement appear in the candidate text segments. And if the candidate text segment has words or sentences contained in the content quality requirement, determining that the candidate text segment does not meet the content quality requirement, otherwise determining that the candidate text segment meets the content quality requirement. And deleting the candidate text segments which do not meet the content quality requirement from the candidate text segments, thus obtaining the screened candidate text segments and realizing the screening and the deterioration of the candidate text segments. The video clip extraction can be carried out on the original video according to the time information marked on the screened candidate text clip, and the target video is obtained based on the extracted video clip.

In one possible implementation, S305 includes: and processing the original video according to the image identification information of the image frame in the original video and the time information marked on the candidate text segment to obtain the target video.

In this implementation manner, if the target video is extracted from the original video by solely depending on the time information marked on the candidate text segment, the target video may have a too short time, for example, the total time of the time information marked on one candidate text segment is less than 1 minute. In order to avoid the situation, video segments corresponding to the candidate text segments can be extracted from the original video based on the time information marked on the candidate text segments, and the video segments corresponding to the candidate text segments are combined into at least one target video based on the image identification information of the image frames in the original video. Thereby, the quality of the target video is further improved. Wherein the image recognition information may be used to determine the associated video segments during the merging of the video segments. The associated video segment may include at least one of: adjacent video clips, video clips with similar styles and video clips describing the same recommended object.

Thus, further optionally, S305 comprises: extracting a video clip corresponding to the candidate text clip from the original video according to the time information marked on the candidate text clip; and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain the target video. Therefore, the same target video describes the same recommended object, and the quality of the target video is improved.

In this optional manner, image matching may be performed on image frames in two video clips to obtain similarity of the image frames in the two video clips, and if the similarity of the image frames in the two video clips is greater than a similarity threshold, it is determined that the two video clips describe the same recommended object, and the two video clips are merged into the same video to obtain a new video clip. Therefore, under the condition of more video clips, the merging operation can be carried out for multiple times, and multiple target videos can be finally obtained. It should be noted that, when merging videos, the video segments may be merged according to a time sequence based on the time information marked on the merged video segments, so as to further improve the quality of the target video.

Further alternatively, the video clips participating in the merging may be video clips filtered based on content quality requirements. Therefore, the video clips meeting the quality requirements are combined into the target video based on the image identification information of the image frames in the video clips meeting the quality requirements, and the quality of the target video is improved.

In the embodiment of the application, multi-modal feature recognition is performed on an original video related to a recommended object to obtain image recognition information of an image frame in the original video and a text segment corresponding to voice in the original video, a target category to which the recommended object belongs in the original video is predicted based on the image recognition information of the image frame in the original video, and a candidate text segment containing effective explanation information is recognized and obtained in the text segment based on the target category of the recommended object. And then, extracting the target video from the original video based on the candidate text segments. Therefore, in the process of realizing automatic generation of the target video, the image information and the text information of the original video are fully utilized, and the quality of the target video is improved. Therefore, the generation efficiency and the quality of the short video are effectively improved, and the generation cost of the short video is reduced.

Optionally, in the process of extracting the target video, the text segment may also be used to generate a subtitle of the target video. In the process of extracting the target video from the original video according to the candidate text segments, the subtitles of the target video can be generated based on the candidate text segments, and the subtitles of the target video can be merged into the target video. Therefore, the complete target video with the subtitles is generated, and the quality of the target video is improved.

Optionally, after the target video is extracted, the server may return a message that the video extraction is completed to the client, or may return the target video like the client. Therefore, the client can know the completion of video extraction in time, and the client can acquire the target video in time. After obtaining the target video, the client can directly publish the target video, and also can perform optimization processing such as further selection and editing on the target video, and publish the target video after the optimization processing. Therefore, the time for extracting the target video from the original video is effectively saved, and the workload of editing personnel is reduced.

Optionally, based on any one of the foregoing embodiments, the image position of the person and the face image in the image identification information of the image frame may be used as an aid to generate a cover page of the target video, so as to improve the quality and integrity of the target video. In one mode, after a target video is extracted, based on image identification information of image frames in the target video, image frames with a face image centered and a human body centered are selected from the image frames of the target video, and the image frames are determined to be cover images of the target video, or the image frames are cut according to a preset size or a preset proportion, and the cut image frames are determined to be the cover images of the target video.

Fig. 4 is a third schematic flowchart of a video generation method provided in the embodiment of the present application, where the method is applied to a client. As shown in fig. 4, the video generation method includes:

s401, responding to the interactive operation of the user on the original video related to the recommended object, sending a video generation request to a server to request for video generation based on the original video.

The original video related to the recommended object may refer to the foregoing embodiments, and is not described again.

The interactive operation may be an interactive operation of a user on a display window of the original video, an interactive operation of a user on a shooting window of the original video, or an input operation of a user for inputting the original video in a video input area.

In this embodiment, when detecting that a user performs an interactive operation on an original video related to a recommended object, a client generates a video generation request and sends the video generation request to a server to request the server to perform video generation based on the original video, and in particular, requests the server to generate a short video based on the original video.

For example, taking an original video as a live video as an example, at a client, a user shoots the live video, clicks a key for ending live recording after the live video is ended, the client generates a video generation request and sends the video generation request to a server when detecting that the user clicks the key or receives a message for ending live video recording, and the server processes the live video at a back end after receiving the video generation request so as to process the live video into one or more short videos.

S402, receiving at least one target video returned by the server, wherein the target video is one video clip in the original video or a combination of a plurality of video clips in the original video.

The target video is obtained by processing the original video based on the image identification information of the image frames in the original video and the text segments corresponding to the voices in the original video, and the specific processing process can refer to any one of the foregoing embodiments and is not repeated in this embodiment.

In this embodiment, after the server generates at least one target video based on the target video, the server may send the generated at least one target video to the client, and after the client receives the at least one target video returned by the server, the client may send a prompt message to the user to prompt the user that the at least one target video has been generated based on the original video, and the client may also display the target video. The user can make further beautification (such as further manual clipping) on the target video on the client, and can also directly release the target video.

In the embodiment of the application, the client side responds to the interactive operation of a user for the original video related to the recommended object, and sends a video generation request to the server so as to process the original video in the background to obtain at least one target video. And the client receives at least one target video returned by the server. Therefore, a user can obtain a target video extracted from an original video only by simple interactive operation, especially obtain a short video extracted from a live video, so that the short video generation efficiency is effectively improved, and the user experience is effectively improved.

Fig. 5 is a block diagram illustrating a configuration of a video generation apparatus 50 according to an embodiment of the present application, where the video generation apparatus 50 is applied to a server. As shown in fig. 5, a video generation apparatus 50 according to an embodiment of the present application includes: an acquisition unit 51, a recognition unit 52 and an extraction unit 53, wherein:

an obtaining unit 51, configured to obtain an original video related to a recommended object in response to a video generation request of a client;

the recognition unit 52 is configured to perform multi-modal feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video;

and the extracting unit 53 is configured to process the original video according to the image identification information and the text segment to obtain at least one target video, where the target video is one video segment in the original video or a combination of multiple video segments in the original video.

In an embodiment of the present application, the extracting unit 53 is specifically configured to: according to the image identification information, carrying out category prediction on a recommended object in the original video to obtain a target category to which the recommended object in the original video belongs; according to the text segment and the target category, identifying effective explanation information of the text segment to obtain a candidate text segment containing the effective explanation information; and processing the original video according to the candidate text segments to obtain the target video.

In an embodiment of the present application, a knowledge graph is pre-constructed, where the knowledge graph includes an image of a recommended object in at least one category, and in a process of performing category prediction on the recommended object in an original video according to image identification information to obtain a target category to which the recommended object in the original video belongs, the extraction unit 53 is specifically configured to: and carrying out image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information, and determining the target category according to the image matching result.

In an embodiment of the present application, the knowledge graph further includes attributes of recommended objects in at least one category, and in the process of identifying effective explanation information for a text segment according to the text segment and a target category to obtain a candidate text segment containing the effective explanation information, the extracting unit 53 is specifically configured to: in a knowledge graph, acquiring attributes of recommendation objects under a target category; and performing text matching on the text segment and the attribute of the recommendation object in the target category to obtain a text segment containing the attribute of the recommendation object in the target category, and determining the candidate text segment as the text segment containing the attribute of the recommendation object in the target category.

In an embodiment of the present application, in the process of processing an original video according to a candidate text segment to obtain a target video, the extracting unit 53 is specifically configured to: screening the candidate text segments according to the content quality requirement; and processing the original video according to the screened candidate text segments to obtain the target video.

In an embodiment of the application, the candidate text segment is labeled with time information, and in the process of processing the original video according to the candidate text segment to obtain the target video, the extracting unit 53 is specifically configured to: and processing the original video according to the image identification information and the time information marked on the candidate text segment to obtain the target video.

In an embodiment of the present application, in the process of processing the original video according to the image identification information and the time information labeled on the candidate text segment to obtain the target video, the extracting unit 53 is specifically configured to: extracting a video clip corresponding to the candidate text clip from the original video according to the time information marked on the candidate text clip; and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain the target video.

The video generation apparatus provided in the embodiment of the present application is configured to implement the technical solution in any one of the method embodiments executed on the server, and the implementation principle and the technical effect are similar, which are not described herein again.

Optionally, the technical solution provided in the embodiment of the present application may be implemented on a cloud server.

Fig. 6 is a block diagram illustrating a structure of a video generating apparatus 60 according to an embodiment of the present application, where the video generating apparatus 60 is applied to a client. As shown in fig. 6, a video generating apparatus 60 according to an embodiment of the present application includes: a transmitting unit 61 and a receiving unit 62, wherein:

a sending unit 61, configured to send a video generation request to a server to request video generation based on an original video in response to an interactive operation of a user on the original video related to a recommendation object;

and the receiving unit 62 is configured to receive at least one target video returned by the server, where the target video is one video segment in the original video or a combination of multiple video segments in the original video, and the target video is obtained by processing the original video based on the image recognition information of the image frames in the original video and the text segment corresponding to the voice in the original video.

The video generation apparatus provided in the embodiment of the present application is configured to execute the technical solution in any of the above method embodiments executed on the client, and the implementation principle and the technical effect are similar, which are not described herein again.

Optionally, the technical solution provided in the embodiment of the present application may be implemented on a terminal.

Fig. 7 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application. The cloud server is used for operating a video generation method, and is used for executing any one of the method embodiments, automatically extracting at least one target video from an original video and ensuring the quality of the target video. As shown in fig. 7, the cloud server includes: a memory 73 and a processor 74.

A memory 73 for storing computer programs and may be configured to store other various data to support operations on the cloud server. The Storage 73 may be an Object Storage Service (OSS).

The memory 73 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 74, coupled to the memory 73, for executing the computer program in the memory 73, so as to execute the video generation method provided by any of the foregoing embodiments

Further, as shown in fig. 7, the cloud server further includes: firewall 71, load balancer 72, communications component 75, power component 76, and other components. Only some of the components are schematically shown in fig. 7, and the cloud server is not meant to include only the components shown in fig. 7.

Correspondingly, the embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the electronic device to perform the steps of the above-described method embodiments.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the steps in the above method embodiments.

Accordingly, the present application also provides a computer program product, which includes a computer program/instruction, when executed by a processor, causes the processor to implement the steps in the above method embodiments.

The communication component 75 of fig. 7 described above is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component 75 is located may access a wireless network based on a communication standard, such as WiFi, a mobile communication network like 2G, 3G, 4G/LTE, 5G, or a combination thereof. In an exemplary embodiment, the communication component 75 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 75 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply module 76 of fig. 7 described above provides power to the various components of the device in which the power supply module 76 is located. The power components 76 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power components 76 are located.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A video generation method applied to a server is characterized by comprising the following steps:

responding to a video generation request of a client, and acquiring an original video related to a recommended object;

performing multi-modal feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video;

and processing the original video according to the image identification information and the text segment to obtain at least one target video, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video.

2. The video generation method of claim 1, wherein the processing the original video according to the image identification information and the text segment to obtain at least one target video comprises:

according to the image identification information, carrying out category prediction on the recommended object in the original video to obtain a target category to which the recommended object in the original video belongs;

according to the text segment and the target category, identifying effective explanation information of the text segment to obtain a candidate text segment containing the effective explanation information;

and processing the original video according to the candidate text segments to obtain the target video.

3. The video generation method according to claim 2, wherein a knowledge graph is pre-constructed, the knowledge graph includes images of recommended objects in at least one category, and the predicting of the category of the recommended object in the original video according to the image identification information to obtain the target category to which the recommended object in the original video belongs includes:

and carrying out image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information, and determining the target category according to an image matching result.

4. The video generation method according to claim 3, wherein the knowledge graph further includes attributes of recommended objects in at least one category, and the identifying of effective explanation information for the text segment according to the text segment and the target category to obtain a candidate text segment containing effective explanation information includes:

in the knowledge graph, acquiring attributes of recommended objects under the target category;

and performing text matching on the text segment and the attribute of the recommendation object in the target category to obtain a text segment containing the attribute of the recommendation object in the target category, and determining the candidate text segment as the text segment containing the attribute of the recommendation object in the target category.

5. The video generation method according to any one of claims 2 to 4, wherein the processing the original video according to the candidate text segment to obtain the target video comprises:

screening the candidate text segments according to the content quality requirement;

and processing the original video according to the screened candidate text segments to obtain the target video.

6. The video generation method according to any one of claims 2 to 4, wherein the candidate text segment is labeled with time information, and the processing the original video according to the candidate text segment to obtain the target video includes:

and processing the original video according to the image identification information and the time information marked on the candidate text segment to obtain the target video.

7. The video generation method according to any one of claim 6, wherein the processing the original video according to the image identification information and the time information labeled on the candidate text segment to obtain the target video comprises:

extracting a video segment corresponding to the candidate text segment from the original video according to the time information marked on the candidate text segment;

and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain the target video.

8. A video generation method is applied to a client and is characterized by comprising the following steps:

responding to the interactive operation of a user for an original video related to a recommended object, and sending a video generation request to a server to request for video generation based on the original video;

receiving at least one target video returned by a server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on image identification information of image frames in the original video and text segments corresponding to voices in the original video.

9. A video generation device applied to a server is characterized by comprising:

the system comprises an acquisition unit, a recommendation unit and a recommendation unit, wherein the acquisition unit is used for responding to a video generation request of a client and acquiring an original video related to a recommended object;

the recognition unit is used for performing multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text segments corresponding to voices in the original video;

and the extracting unit is used for processing the original video according to the image identification information and the text segment to obtain at least one target video, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video.

10. A video generation device applied to a client side comprises:

the device comprises a sending unit, a recommendation unit and a recommendation unit, wherein the sending unit is used for responding to the interactive operation of a user on an original video related to a recommended object and sending a video generation request to a server so as to request the video generation based on the original video;

the receiving unit is used for receiving at least one target video returned by the server, the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on image identification information of image frames in the original video and text segments corresponding to voices in the original video.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the electronic device to perform the video generation method of any of claims 1 to 8.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the video generation method of any one of claims 1 to 8.