CN115022732B

CN115022732B - Video generation method, device, equipment and medium

Info

Publication number: CN115022732B
Application number: CN202210583689.XA
Authority: CN
Inventors: 贺欣; 谢佳雯; 陈建宇; 吴春松; 刘延朋; 常小军; 熊成; 刘成; 赵翊腾; 姜永刚; 李金�; 陈炳辉; 包季真; 黄博翔
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-11-03
Anticipated expiration: 2042-05-25
Also published as: CN115022732A

Abstract

The application provides a video generation method, a device, equipment and a medium, wherein the video generation method comprises the following steps: responding to a video generation request of a client, and acquiring an original video related to a recommended object; carrying out multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text fragments corresponding to voices in the original video; and processing the original video according to the image identification information and the text fragments to obtain at least one target video, wherein the target video is one video fragment in the original video or a combination of a plurality of video fragments in the original video. Therefore, the automatic extraction of the target video is realized based on the original video, the original video is not required to be manually clipped into one or more target videos by a user, the video generation efficiency is improved, the video generation cost is reduced, and meanwhile, the image identification information and the text fragments obtained by multi-mode feature identification are utilized in the extraction process of the target video, so that the video generation quality is ensured.

Description

Video generation method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for generating video.

Background

In recent years, with the rapid development of mobile internet technology and related infrastructure, mobile internet users are increasingly accustomed to watching short videos, and applications related to short videos occupy most of the time for surfing the internet of mobile internet users. Under the background, the business related to the short video in the electronic market rapidly develops, and the short video brings a large amount of free commodity-carrying traffic for merchants on line.

In the related art, a merchant uses an original video with long shooting time, and performs complex editing operation on the original video with a period of up to several hours in an editing tool manually, and the process consumes a lot of time of editing personnel, so that the production efficiency of a short video is low and the production cost is high, and further, part of merchants cannot develop or reduce short video operation services due to cost limitation.

Therefore, how to produce high-quality short video with high efficiency and low cost is a problem to be solved at present.

Disclosure of Invention

The application provides a video generation method, a device, equipment and a medium, which are used for solving the problem of how to efficiently and low-cost produce high-quality short videos.

In a first aspect, an embodiment of the present application provides a video generating method, applied to a server, including: responding to a video generation request of a client, and acquiring an original video related to a recommended object; carrying out multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text fragments corresponding to voices in the original video; and processing the original video according to the image identification information and the text fragments to obtain at least one target video, wherein the target video is one video fragment in the original video or a combination of a plurality of video fragments in the original video.

In a second aspect, an embodiment of the present application provides a video generating method, applied to a client, including: in response to a user interaction operation for an original video related to a recommended object, sending a video generation request to a server to request video generation based on the original video; and receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on image identification information of image frames in the original video and text segments corresponding to voices in the original video.

In a third aspect, an embodiment of the present application provides a video generating apparatus, including: the acquisition unit is used for responding to the video generation request of the client and acquiring the original video related to the recommended object; the recognition unit is used for carrying out multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text fragments corresponding to voices in the original video; the extraction unit is used for processing the original video according to the image identification information and the text fragments to obtain at least one target video, wherein the target video is one video fragment in the original video or a combination of a plurality of video fragments in the original video.

In a fourth aspect, an embodiment of the present application provides a video generating apparatus, including: a transmitting unit configured to transmit a video generation request to a server to request video generation based on an original video in response to an interactive operation of a user with respect to the original video related to a recommended object; the receiving unit is used for receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, and the target video is obtained by processing the original video based on the image identification information of the image frames in the original video and the text segments corresponding to the voices in the original video.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the electronic device to perform the video generation method provided in the first and/or second aspect of the application.

In a sixth aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video generation method provided in the first and/or second aspects of the present application.

In a seventh aspect, embodiments of the present application provide a computer program product comprising: computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the video generation method provided in the first and/or second aspect of the application.

As can be seen from the above technical solutions, in the embodiments of the present application, multi-modal feature recognition is performed on an original video related to a recommended object, so as to obtain image recognition information of an image frame in the original video and a text segment corresponding to speech in the original video, and the original video is processed according to the image recognition information and the text segment, so as to obtain at least one target video, where the target video is one video segment in the original video or a combination of multiple video segments in the original video. Therefore, the embodiment of the application realizes the automatic extraction of the target video, namely realizes the automatic extraction of the short video, improves the short video generation efficiency, reduces the short video generation cost, and can extract the short video containing effective content from the original video based on the image information and the text information obtained by the multi-mode feature recognition, thereby improving the quality of the short video.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic view of a video generating method according to an embodiment of the present application;

fig. 2 is a flowchart of a video generating method according to an embodiment of the present application;

fig. 3 is a second schematic flow chart of a video generating method according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a video generating method according to an embodiment of the present application;

fig. 5 is a block diagram of a video generating apparatus 50 according to an embodiment of the present application;

fig. 6 is a block diagram of a video generating apparatus 60 according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "comprises" and "comprising," along with any variations thereof, in the description and claims of the application and in the above-described figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus.

First, a part of terms related to the embodiment of the present application will be explained:

short video: videos with a duration less than a duration threshold (e.g., 5 minutes, 10 minutes) may differ from short video application to short video. In the field of electronic commerce, the short video is used for rapidly introducing commodities to users, so that the time of the users is saved, and the characteristics of the commodities can be highlighted.

In the related art, a short video is generated by manually editing an original video for up to several hours, resulting in lower production efficiency and higher production cost of the short video. If the original video is simply divided into a plurality of short videos, the quality of the short videos cannot be guaranteed.

In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a medium for generating video. In the embodiment of the application, the image identification information of the image frames in the original video and the text fragments corresponding to the voices in the original video are obtained by carrying out multi-mode feature identification on the original video related to the recommended object, and the original video is processed based on the image identification information of the image frames in the original video and the text fragments corresponding to the voices in the original video to obtain at least one target video. Therefore, the target video with shorter duration is automatically extracted from the original video, namely, the automatic extraction of the short video is realized, manual editing is not needed, the generation efficiency of the short video is improved, and the generation cost of the short video is reduced; in addition, based on the image information related to the image frames in the original video and the text information related to the voice in the original video, the short video containing the effective content can be extracted from the original video, and the quality of the short video is improved. Therefore, the embodiment of the application effectively solves the problem of how to efficiently and inexpensively generate high-quality short videos.

Optionally, the recommended object includes a merchandise object. Therefore, the target video related to the commodity object is automatically extracted from the original video related to the commodity object, the generation efficiency of the short video of the commodity object and the quality of the short video of the commodity object are improved, and the generation cost of the short video of the commodity object is reduced. Therefore, by using the embodiment of the application, on one hand, merchants can be helped to quickly produce short videos with goods in a nearly zero-cost mode, store sales volume is increased, and on the other hand, the supply volume of the short videos is increased, so that the viewing requirements of consumers on the short videos can be better met.

Optionally, the original video related to the recommended object includes live video introducing the recommended object. Therefore, the target video of the recommended object is automatically extracted based on the live video, namely, the characteristics of the recommended object is introduced by utilizing the live video, and the short video of the recommended object is automatically generated based on the live video without the need of a user to specially shoot video materials for the short video, so that the generation efficiency of the short video is effectively improved, the quality of the short video is ensured, and the generation cost of the short video is reduced. In particular, in the case where the recommended object includes a merchandise object, the original video related to the merchandise object includes a live video introducing the merchandise object.

Fig. 1 is an application scenario schematic diagram of a video generating method according to an embodiment of the present application. As shown in fig. 1, the apparatus performing the video generating method is a video generating apparatus, and the video generating apparatus may be connected to a client.

A client may be any computing device having some data processing capabilities. At this time, the basic structure of the client may include: at least one processor. The number of processors depends on the configuration and type of client. The client may also include memory, which may be volatile, such as RAM, or nonvolatile, such as Read-Only memory (ROM), flash memory, etc., or both. The memory typically stores an Operating System (OS), one or more application programs, program data, and the like. In addition to the processing unit and the memory, the client comprises some basic configuration, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, a mouse, a stylus, a printer, etc. Other peripheral devices are well known in the art and are not described in detail herein. Alternatively, the client may be a PC (personal computer) terminal, a handheld terminal (e.g., smart phone, tablet computer), or the like.

The video generating apparatus may be a device that provides video processing services in a network virtual environment. Optionally, the video generating apparatus may be an apparatus deployed with a model for identifying multi-modal features, and in the video generating apparatus, the image frames in the original video and the voices in the video may be identified based on the deployed model, and then the original video may be processed based on the identification result to obtain at least one target video.

In physical implementation, the video generating apparatus may be any device capable of providing a computing service, responding to a service request, and performing processing, for example: may be a cluster server, a conventional server, a cloud host, a virtual center, etc. The video generating apparatus mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture.

Wherein the client may be in a network connection with the video generating apparatus, which may be a wireless or wired network connection. If the client is in communication connection with the video generating device, the network system of the mobile network may be any of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4g+ (lte+), wiMax, 5G, and the like.

In the embodiment of the application, the client can send a video generation request to the video generation device to request the video generation device to extract the video segments based on the original video to obtain at least one target video. Optionally, the video generating apparatus may return a message that the target video is extracted to the client, or may return at least one target video extracted from the original video to the client.

Preferably, in one application scenario, the client sends a video generation request to the video generation device to request the video generation device to generate a short video of the commodity object based on the live video of the commodity object; the video generation device responds to the video generation request, acquires the live video of the commodity object, and extracts at least one short video from the live video in a multi-mode feature recognition mode. After that, optionally, the video generating device returns a message of completing the extraction of the short video to the client, and may also return at least one short video extracted from the live video to the client.

The above is just one exemplary application scenario. The embodiment of the application can be applied to video extraction of other recommended objects in the Internet, such as video extraction for recommending television shows, video extraction for recommending books and the like, besides video of commodity objects.

The technical scheme of the application is described in detail below through specific embodiments with reference to the accompanying drawings. It should be noted that the following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a flowchart illustrating a video generation method according to an embodiment of the present application, where the method may be applied to a server. As shown in fig. 2, the video generating method provided by the embodiment of the present application includes:

s201, responding to a video generation request of a client, and acquiring an original video related to a recommended object.

Wherein the video generation request is for requesting extraction of at least one target video from the original video associated with the recommended object. Accordingly, the video generation request may include video information (such as a video name, a video storage address, a video capturing time, etc.) of the original video related to the recommended object and/or the original video related to the recommended object so that the server may accurately obtain the original video based on the video generation request.

Wherein, the original video related to the recommended object refers to that the video content of the original video is related to the recommended object, and in particular, the original video related to the recommended object may include the original video showing, introducing and/or commenting on the recommended object. As described in the foregoing, the original video related to the recommended object may optionally include live video of the recommended object, and thus, when the recommended object is a commodity object, the original video related to the recommended object may include live video of the commodity object.

In this embodiment, the client may send a video generation request to the server in response to the user's interactive operation. After receiving the video generation request from the client, the server may obtain, in response to the video generation request, the original video related to the recommended object from the client, may obtain the original video related to the recommended object from another device (for example, a storage device), and may obtain the original video related to the recommended object stored locally by the server. Wherein the original video with the recommended object can be one or more.

In one possible implementation, the client sends a video generation request to the server in response to the end of the original video capture associated with the recommended object. Therefore, after the shooting of the original video is finished, the target video is timely extracted from the original video, the video generation efficiency is improved, and the user experience is improved. For example, a user live and records live video on a client, clicks to end recording after live, the client responds to interactive operation of clicking to end recording by the user, a video generation request is sent to a server, and the server automatically extracts at least one target video from the recorded live video at the back end.

In one possible implementation manner, the server acquires the original video related to the recommended object from the client, may acquire the original video related to the recommended object from the video generation request, and may also respond to the video generation request to return a video acquisition request to the client, so as to obtain the original video related to the recommended object returned by the client in response to the video acquisition request.

S202, multi-mode feature recognition is carried out on the original video, and image recognition information of image frames in the original video and text fragments corresponding to voices in the original video are obtained.

Multi-modal features, among other things, refer to features in multiple forms (e.g., images, text, speech, etc.).

Wherein the image identification information of the image frame may include object information of a target object identified in the image frame, the target object may include a recommended object and/or a person, the object information of the recommended object may include an image position of the recommended object and an initial category of the target object, and the object information of the person may include one or more of an image position of the person, a sex of the person, and a face image of the person.

For example, the image position and the initial category of the recommended object are identified in some image frames, the image position of the person is identified in other image frames, and the image position of the recommended object, the initial category of the recommended object, and the image position of the person are identified in still other image frames.

Alternatively, in the case where the original video related to the target object is a live video of a commodity object, the image identification information of the image frame may include object information of the commodity object in the image frame and/or object information of a person in the image frame. The object information of the commodity object may include an initial category of the commodity object (such as coat, pants, shoes) and an image position of the commodity object, among others.

The original video may include a plurality of segments of speech, in which one or more text segments may be identified, in which text segments text described by the speech is recorded.

In this embodiment, different feature recognition models may be used for features of different modalities. For the original video, an image recognition model can be adopted to recognize the image frames in the original video, so that the image recognition information of the image frames in the original video is obtained; and recognizing the voice in the original video by adopting a voice recognition model to obtain a text segment corresponding to the voice in the original video. Here, the specific model structure of the image recognition model and the voice recognition model is not limited.

S203, processing the original video according to the image identification information of the image frames in the original video and the text fragments corresponding to the voices in the original video to obtain at least one target video.

The target video is one video segment in the original video or a combination of a plurality of video segments in the original video. In the case that the target videos are plural, all the target videos may be one video clip in the original video, all the target videos may be a combination of plural video clips in the original video, or part of the target videos may be one video clip in the original video, and the other part of the target videos may be a combination of plural video clips in the original video.

In this embodiment, in the original video, not all video contents are related to the recommended object, considering that the duration of the target video is shorter than that of the original video, the video contents in the original video which are not related to the recommended object may be regarded as invalid contents, the video contents related to the recommended object may be regarded as valid contents, and the video quality of the target video may be ensured by judging the valid contents in the original video. In the original video, the image identification information of the image frames and the text segments corresponding to the voices in the original video can reflect the video content of the original video, so that the effective content in the original video can be identified based on the image identification information of the image frames in the original video and the text segments corresponding to the voices in the original video, and the original video is processed based on the effective content in the original video to obtain at least one target video. In the process of processing the original video, video segment extraction can be performed on the original video based on the effective content of the original video, and finally, at least one target video is obtained based on the video segment extracted from the original video.

In one possible implementation, the active content in the original video may include active interpretation information, i.e., active interpretation information for the recommended object. For example, in a live video of a commodity object, the explanation of characteristics such as advantages and materials of the commodity object belongs to effective explanation information, and interaction with a bullet screen does not belong to effective explanation information. Therefore, the candidate text segments containing effective explanation information in the text segments can be determined based on the image identification information of the image frames in the original video and the text segments corresponding to the voices in the original video, and the original video is processed based on the candidate text segments to obtain the target video. Therefore, the accuracy of identifying the effective content in the original video is improved by determining the candidate documents containing the effective explanation information, so that the accuracy of extracting the target video from the original video is improved, and the quality of the target video is ensured.

In this implementation manner, the image identification information of the image frame in the original video includes the initial category of the recommended object and the image position of the recommended object, so that whether the text segment includes the explanation information related to the recommended object can be determined based on the image identification information of the image frame in the original video, and if so, it can be determined that the text segment belongs to the candidate text segment including the effective explanation information. For example, the initial category of the recommended object included in the image identification information of the image frame is a jacket, and it is determined whether the text segment includes explanatory information related to the jacket, for example, the size, color, texture, and the like of the jacket, and if so, it is determined that the text segment belongs to a candidate text segment of the effective explanatory information. Thus, the accuracy of screening candidate text fragments in the text fragments is improved, and the quality of the target video is further improved.

In yet another possible implementation manner, candidate image frames (such as image frames for identifying recommended objects) including valid image content may be determined based on image identification information of image frames in the original video, and candidate text segments including valid explanation content may be determined based on text segments corresponding to voices in the original video. At least one target video is extracted from the original video by combining a candidate image frame containing active image content and a candidate text segment containing active lecture content. Therefore, the quality of the target video is improved by combining the screening of the image frames and the screening of the candidate text fragments. In the process of combining the candidate image frames containing the effective image content and the candidate text fragments containing the effective explanation content and extracting at least one target video from the original video, the at least one target video can be obtained by clipping in the original video based on the time information corresponding to the candidate image frames and the time information corresponding to the candidate text fragments.

In the embodiment of the application, in response to a video generation request of a client, multi-mode feature recognition is carried out on an original video related to a recommended object, so that image recognition information of an image frame in the original video and a text segment corresponding to voice in the original video are obtained; at least one target video is extracted from the original video by identifying valid content in the original video based on the image identification information and the text segment. Therefore, the method and the device realize automatic extraction of the target video from the original video, particularly realize automatic extraction of the short video from the live video, effectively improve the generation efficiency of the short video, reduce the generation cost of the short video, improve the quality of the short video, and efficiently and cheaply generate the short video with high quality.

Fig. 3 is a second flowchart of a video generating method according to an embodiment of the present application. As shown in fig. 3, the video generating method provided by the embodiment of the present application includes:

s301, responding to a video generation request of a client, and acquiring an original video related to a recommended object.

S302, multi-mode feature recognition is carried out on the original video, and image recognition information of image frames in the original video and text fragments corresponding to voices in the original video are obtained.

The implementation principles and technical effects of S301 to S302 may refer to the foregoing embodiments, and are not repeated.

S303, carrying out category prediction on the recommended objects in the original video according to the image identification information of the image frames in the original video to obtain target categories to which the recommended objects in the original video belong.

In this embodiment, in the multi-modal feature recognition, the initial category of the recommended object may be recognized in the image frame, but in the actual scene, the recommended object is often further finely classified according to the style to which the recommended object belongs and the group to which the recommended object is applicable, and when the recommended object is explained, the explanation words applicable to the recommended objects of different classifications are different. Taking a commodity object as an example, in the multi-mode feature recognition, a recommended object in an image frame is recognized as a jacket, and the jacket can be further divided into different categories on an electronic commerce platform, such as women's clothing, men's clothing, children's clothing, T-shirts, long sleeves, college wind, professional clothing and the like, and different categories are applicable to different explanation words. Therefore, after the multi-mode feature recognition, the category of the recommended object needs to be further predicted based on the image recognition information of the image frame, so as to obtain the target category of the recommended object, and the judgment accuracy of the effective explanation content is improved according to the target category of the recommended object.

In this embodiment, after obtaining the image identification information of the image frame in the original video, since the image identification information includes the initial category of the recommended object and the image position of the recommended object, the category prediction may be performed on the recommended object appearing in the original video based on the initial category of the recommended object and the image position of the recommended object in the image identification information, so as to obtain the target category of the recommended object in the original video. In one mode, the category range of the recommended object can be obtained based on the initial category of the recommended object in the image identification information, for example, if the initial category is a coat, the category range including a coat for women, a coat for men and the like can be determined; then, based on the image position of the recommended object in the image identification information, the recommended object can be further identified in the category range, and the target category of the recommended object is obtained, so that the accuracy of the target category is improved.

In one possible implementation, the object information of the person in the image identification information may be used to assist in predicting the category of the recommended object in the original video, so as to further improve the accuracy of the target category. For example, if the merchandise object is a coat worn by a person, the sex, image position, face image, etc. of the person may be used to assist in determining whether the coat is a women's coat, men's coat, children's coat, etc.

In yet another possible implementation manner, text segments corresponding to voices in the original video can be used for assisting in category prediction of recommended objects in the original video, so that accuracy of target categories is further improved. Keyword recognition can be performed in the text segment, the keywords are related to the category of the recommended object, and category prediction is performed on the recommended object in the original video by combining the keywords recognized from the text segment overlapped with the time information of the image frame and the image recognition information of the image frame, so that the target category to which the recommended object in the original video belongs is obtained. For example, if the recommended object is a commodity object, the text segment identifies "college wind" and "girl" and the initial category of the commodity object in the image identification information of the image frame is "shoe", the target category of the commodity object in the image frame may be determined as college wind women's shoes.

In yet another possible implementation, a knowledge graph is pre-constructed, the knowledge graph including images of recommended objects under at least one category. At this time, S303 includes: and performing image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information of the image frame in the original video, and determining the target category of the recommended object in the original video according to the image matching result. Therefore, the accuracy of the target category is improved based on a pre-constructed knowledge graph and an image matching mode.

Wherein, in the knowledge graph, one category may correspond to one or more images of the recommended object. The image identification information of the image frame may be embodied in the form of an image, for example, an image position of a recommended object, an initial category of the recommended object, an image position of a person, a sex of the person, and the like are noted on the image frame.

In the implementation manner, image matching is performed on the image of the recommended object under at least one category in the knowledge graph and the image identification information of the image frame in the original video, so that the similarity of the image of the recommended object under at least one category and the image identification information of the image frame is obtained. Then, among the images of the recommended objects in at least one category, the image with the highest similarity with the image identification information of the image frame is determined, and the target category of the recommended object on the image frame is determined as the category to which the image with the highest similarity belongs.

Under the condition that the number of the recommended objects in the knowledge graph is large, if the image of the recommended object under each category is matched with the image identification information of the image frame one by one, the efficiency of category prediction is low, and in order to improve the efficiency of category prediction, the following method may be further selected in S303: acquiring an image of a recommended object under a candidate category in a knowledge graph according to an initial category of the recommended object in image identification information of the image frame; and performing image matching on the image of the recommended object under the candidate category and the image identification information of the image frame, and determining the target category according to the image matching result.

Wherein the candidate categories are associated with initial categories, one of which may correspond to one or more candidate categories. Further, the candidate class may belong to the initial class, in other words, the candidate class may be a sub-class of the initial class. For example, where the initial category is "shoe," the candidate category may include "high-heeled shoes," "cloth shoes," "athletic shoes," "leather shoes," and the like.

In this alternative manner, at least one candidate category may be determined from the categories of recommended objects included in the knowledge graph according to the initial category of the recommended object in the image identification information of the image frame, and the image of the recommended object under the candidate category may be acquired in the knowledge graph. Then, the image of the recommended object under the candidate category can be subjected to image matching with the image identification information of the image frame, so that the similarity between the image of the recommended object under the candidate category and the image identification information of the image frame is obtained. Then, an image with the highest similarity with the image identification information of the image frame can be determined in the images of the recommended objects under the candidate categories, and the target category of the recommended objects on the image frame is determined to be the category to which the image with the highest similarity belongs. Therefore, based on the initial category of the recommended object, the efficiency and accuracy of determining the target category of the recommended object are improved.

Optionally, in the case that the recommended object is a commodity object, the knowledge graph is a commodity knowledge graph. An image of the commodity object under at least one class may be included in the commodity knowledge graph. Items such as men's wear, women's wear, household items, home appliances, etc. may also be further subdivided and are not described in detail herein.

S304, identifying the text segments corresponding to the voices in the original video according to the text segments corresponding to the voices in the original video and the target categories of the recommended objects in the original video, and obtaining candidate text segments containing the effective explanation information.

In this embodiment, the recommended objects of different categories have different characteristics, that is, different attributes, for example, the attributes of the garment may include "upper body is thin", "all cotton material", "silk material", etc., and the attributes of the shoe may include "shoe size is accurate", "shock absorption and buffering performance is good", "sole is anti-skid", etc. After obtaining the text segment corresponding to the voice in the original video and the target category of the recommended object in the original video, the text segment corresponding to the voice in the original video can be identified based on the attribute corresponding to the target category of the recommended object in the original video, the text segment containing the effective explanation information is obtained, and the candidate text segment is determined to be the text segment containing the effective explanation information.

In one possible implementation, a knowledge graph is pre-constructed, where the knowledge graph includes attributes of the recommended objects under at least one category. The knowledge graph may be the knowledge graph in the foregoing embodiment, where the knowledge graph includes the image and the attribute of the recommended object under at least one category. In the case that the knowledge graph is a commodity knowledge graph, the attribute of the recommended object may further include a selling point of the commodity object. Taking a commodity object of clothing as an example, the commodity knowledge graph can comprise properties such as commodity materials, styles, colors, styles and the like of the commodity object, and can also comprise selling points to be introduced for the commodity object during live broadcast.

Based on the knowledge-graph including attributes of the recommended object under at least one category, S304 includes: acquiring the attribute of a recommended object under the target category in the knowledge graph; and performing text matching on the text fragments corresponding to the voices in the original video and the attributes of the recommended objects under the target category to obtain text fragments containing the attributes of the recommended objects under the target category, and determining the candidate text fragments as the text fragments containing the attributes of the recommended objects under the target category. Therefore, the accuracy of screening candidate text fragments in the text fragments is improved by combining a knowledge graph and a text matching mode, and the quality of a target video is further improved.

In the implementation mode, after determining the target category of the recommended object in the original video, acquiring the attribute of the recommended object under the target category from the knowledge graph; text matching is carried out on the text fragments corresponding to the voices in the original video and the attributes of the recommended objects under the target category, and whether the text fragments corresponding to the voices in the original video contain the attributes of the recommended objects under the target category or not is determined; if the text segment corresponding to the voice in the original video contains the attribute of the recommended object under the target category, the text segment can be considered as the text segment containing the effective explanation information, and the text segment is determined as the candidate text segment.

S305, processing the original video according to the candidate text segment to obtain a target video.

The text segment corresponding to the voice in the original video may be marked with time information, where the time information may include a start time and an end time, or the time information may include a start time and a duration of the text segment, or the time information may include a duration of the text segment and an end time. Because the text segment corresponds to the voice in the original video, the time information of the voice in the original video can be marked to the corresponding text segment in the process of identifying the voice in the original video to obtain the text segment.

In this embodiment, after determining the candidate text segment, the video segment extraction may be performed on the original video based on the time information marked on the candidate text segment, and the target video may be obtained based on the extracted video segment. In one mode, video segments between time information marked by candidate text segments can be extracted from an original video to obtain a target video. Since the candidate text segment contains valid explanation content, the target video contains valid video content.

In one possible implementation, S305 includes: screening candidate text fragments according to content quality requirements; and processing the original video according to the screened candidate text fragments to obtain a target video. Therefore, after the candidate text fragments are obtained, the candidate text fragments are further screened and degraded, namely the candidate text fragments which do not meet the content quality requirement are removed, and the quality of the target video is further improved.

The content quality requirement can be embodied as a word or sentence which is preset and affects the introduction effect of the recommended object. In the case that the recommended object is a commodity object, the content instruction requirement may be embodied as a preset word or sentence affecting the effect of the commodity, such as a word or sentence related to the price of the commodity, the interaction between the host and the audience in the live broadcast process, the interaction between the host and the guest in the live broadcast process, and the like.

In this implementation, the words or sentences contained in the content quality requirement may be matched with the candidate text segments, and it may be determined whether the words or sentences contained in the content quality requirement appear in the candidate text segments. If the words or sentences contained in the content quality requirements appear in the candidate text fragments, determining that the candidate text fragments do not meet the content quality requirements, otherwise, determining that the candidate text fragments meet the content quality requirements. And deleting the candidate text fragments which do not meet the content quality requirement from the candidate text fragments, so as to obtain the screened candidate text fragments, and screening the candidate text fragments for deterioration. And extracting video segments from the original video according to the time information marked on the screened candidate text segments, and obtaining a target video based on the extracted video segments.

In one possible implementation, S305 includes: and processing the original video according to the image identification information of the image frames in the original video and the time information marked on the candidate text fragments to obtain the target video.

In this implementation manner, if the target video is extracted from the original video by solely relying on the time information marked on the candidate text segment, the target video may have an excessively short duration, for example, the total duration of the time information marked on one candidate text segment is less than 1 minute. To avoid this, a video clip corresponding to the candidate text clip may be extracted from the original video based on time information noted on the candidate text clip, and the video clip corresponding to the candidate text clip may be merged into at least one target video based on image identification information of image frames in the original video. Thereby, the quality of the target video is further improved. Wherein the image identification information may be used to determine the associated video segment during the merging of the video segments. The associated video clip may include at least one of: adjacent video clips, video clips of similar styles, video clips describing the same recommended object.

Thus, further optionally, S305 includes: extracting video clips corresponding to the candidate text clips from the original video according to the time information marked on the candidate text clips; and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain the target video. Therefore, the same target video describes the same recommended object, and the quality of the target video is improved.

In this optional manner, image matching may be performed on image frames in two-by-two video clips to obtain similarity of image frames in two-by-two video clips, and if the similarity of image frames in two-by-two video clips is greater than a similarity threshold, it is determined that the two-by-two video clips describe the same recommended object, and the two-by-two video clips are combined into the same video to obtain a new video clip. Thus, when the video clips are more, the merging operation can be performed for a plurality of times, and finally a plurality of target videos are obtained. It should be noted that, when video merging, the video clips may be merged according to a time sequence based on the time information marked on the merged video clip, so as to further improve the quality of the target video.

Further alternatively, the video segments involved in the merging may be video segments screened based on content quality requirements. Therefore, the video clips meeting the quality requirements are combined into the target video based on the image identification information of the image frames in the video clips meeting the quality requirements, and the quality of the target video is improved.

In the embodiment of the application, multi-mode feature recognition is performed on an original video related to a recommended object, so that image recognition information of an image frame in the original video and a text segment corresponding to voice in the original video are obtained, a target category to which the recommended object belongs in the original video is predicted based on the image recognition information of the image frame in the original video, and a candidate text segment containing effective explanation information is recognized in the text segment based on the target category of the recommended object. And extracting the target video from the original video based on the candidate text fragments. Therefore, in the process of automatically generating the target video, the image information and the text information of the original video are fully utilized, and the quality of the target video is improved. Therefore, the generation efficiency and the quality of the short video are effectively improved, and the generation cost of the short video is reduced.

Optionally, in the process of extracting the target video, the text segment may also be used to generate subtitles of the target video. In the process of extracting the target video from the original video according to the candidate text segments, subtitles of the target video can be generated based on the candidate text segments, and the subtitles of the target video can be combined into the target video. Thus, the complete target video with the subtitle is generated, and the quality of the target video is improved.

Optionally, after extracting the target video, the server may return a message that video extraction is completed to the client, or may return the target video like the client. Therefore, the client side can know that the video extraction is completed in time, and the client side can acquire the target video in time. After the client obtains the target video, the client can directly issue the target video, can also perform optimization processing such as further selection, clipping and the like on the target video, and issue the target video after the optimization processing. It can be seen that the embodiment of the application effectively saves the time for extracting the target video from the original video and reduces the workload of editing personnel.

Optionally, based on any of the foregoing embodiments, the image position of the person and the face image in the image identification information of the image frame may be used to assist in generating a cover map of the target video, so as to improve the quality and integrity of the target video. In one mode, after the target video is extracted, selecting an image frame with a face image centered and a human body centered from the image frames of the target video based on the image identification information of the image frames in the target video, determining that the image frame is a cover image of the target video, or cutting the image frame according to a preset size or a preset proportion, and determining that the cut image frame is the cover image of the target video.

Fig. 4 is a flowchart illustrating a video generation method according to an embodiment of the present application, where the method is applied to a client. As shown in fig. 4, the video generation method includes:

s401, responding to the interaction operation of the user on the original video related to the recommended object, and sending a video generation request to a server to request video generation based on the original video.

The original video related to the recommended object may refer to the foregoing embodiments, and will not be described in detail.

The interactive operation may be an interactive operation performed by a user on a display window of the original video, an interactive operation performed by a user on a shooting window of the original video, or an input operation performed by a user in a video input area for inputting the original video.

In this embodiment, when detecting an interaction operation of a user with respect to an original video related to a recommendation object, the client generates a video generation request, and sends the video generation request to the server to request the server to generate a video based on the original video, and in particular, requests the server to generate a short video based on the original video.

For example, taking an original video as a live video, taking the live video as an example on a client, after the live video is finished, clicking a key for finishing live video recording, generating a video generation request by the client under the condition that the client detects that the user clicks the key, or under the condition that the client receives a message for finishing live video recording, sending the video generation request to a server, and processing the live video at the rear end after receiving the video generation request so as to process the live video into one or more short videos by the server.

S402, receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video.

The target video is obtained by processing the original video based on the image identification information of the image frame in the original video and the text segment corresponding to the voice in the original video, and the specific processing procedure can refer to any one of the foregoing embodiments, and details are not repeated in this embodiment.

In this embodiment, after generating at least one target video based on the target video, the server may send the generated at least one target video to the client, and after receiving the at least one target video returned by the server, the client may send a prompt message to the user to prompt the user that the at least one target video has been generated based on the original video, and the client may also display the target video. The user can make further beautification (such as further manual editing) on the client, and can also directly release the target video.

In the embodiment of the application, a client side responds to the interactive operation of a user on an original video related to a recommended object, and sends a video generation request to a server so as to process the original video in the background to obtain at least one target video. The client receives at least one target video returned by the server. Therefore, a user can obtain the target video extracted from the original video only by simple interactive operation, especially the short video extracted from the live video, so that the generation efficiency of the short video is effectively improved, and the user experience is effectively improved.

Fig. 5 is a block diagram of a video generating apparatus 50 according to an embodiment of the present application, where the video generating apparatus 50 is applied to a server. As shown in fig. 5, a video generating apparatus 50 provided in an embodiment of the present application includes: an acquisition unit 51, an identification unit 52, and an extraction unit 53, wherein:

an obtaining unit 51, configured to obtain an original video related to a recommendation object in response to a video generation request of a client;

the identifying unit 52 is configured to perform multi-mode feature identification on the original video, so as to obtain image identification information of an image frame in the original video and a text segment corresponding to voice in the original video;

the extracting unit 53 is configured to process the original video according to the image identification information and the text segment to obtain at least one target video, where the target video is one video segment in the original video or a combination of multiple video segments in the original video.

In one embodiment of the application, the extraction unit 53 is specifically configured to: according to the image identification information, carrying out category prediction on the recommended objects in the original video to obtain target categories to which the recommended objects in the original video belong; according to the text fragments and the target categories, identifying the text fragments with effective explanation information to obtain candidate text fragments containing the effective explanation information; and processing the original video according to the candidate text fragments to obtain a target video.

In one embodiment of the present application, a knowledge graph is pre-constructed, the knowledge graph includes at least one image of a recommended object under a category, and in a process of performing category prediction on the recommended object in an original video according to image identification information to obtain a target category to which the recommended object in the original video belongs, the extraction unit 53 is specifically configured to: and performing image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information, and determining a target category according to an image matching result.

In one embodiment of the present application, the knowledge graph further includes an attribute of a recommended object under at least one category, and the extracting unit 53 is specifically configured to: acquiring the attribute of a recommended object under the target category in the knowledge graph; and performing text matching on the text fragments and the attributes of the recommended objects under the target category to obtain text fragments containing the attributes of the recommended objects under the target category, and determining candidate text fragments as text fragments containing the attributes of the recommended objects under the target category.

In one embodiment of the present application, in the process of processing the original video according to the candidate text segment to obtain the target video, the extracting unit 53 is specifically configured to: screening candidate text fragments according to content quality requirements; and processing the original video according to the screened candidate text fragments to obtain a target video.

In one embodiment of the present application, the candidate text segment is marked with time information, and the extracting unit 53 is specifically configured to: and processing the original video according to the image identification information and the time information marked on the candidate text segment to obtain a target video.

In one embodiment of the present application, the extracting unit 53 is specifically configured to, in a process of processing an original video according to the image identification information and the time information marked on the candidate text segment to obtain the target video: extracting video clips corresponding to the candidate text clips from the original video according to the time information marked on the candidate text clips; and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain the target video.

The video generating apparatus provided in the embodiment of the present application is configured to execute the technical scheme in any of the foregoing embodiments of the method executed on the server, and its implementation principle and technical effects are similar, and are not described herein again.

Optionally, the technical scheme provided by the embodiment of the application can be implemented on a cloud server.

Fig. 6 is a block diagram of a video generating apparatus 60 according to an embodiment of the present application, where the video generating apparatus 60 is applied to a client. As shown in fig. 6, a video generating apparatus 60 provided in an embodiment of the present application includes: a transmitting unit 61 and a receiving unit 62, wherein:

a transmitting unit 61 for transmitting a video generation request to the server to request video generation based on the original video in response to an interactive operation of the user with respect to the original video related to the recommended object;

the receiving unit 62 is configured to receive at least one target video returned by the server, where the target video is one video segment in the original video or a combination of multiple video segments in the original video, and the target video is obtained by processing the original video based on image identification information of an image frame in the original video and a text segment corresponding to voice in the original video.

The video generating apparatus provided in the embodiment of the present application is configured to execute the technical scheme in any of the foregoing embodiments of the method executed on the client, and its implementation principle and technical effects are similar, and are not described herein again.

Optionally, the technical solution provided by the embodiment of the present application may be implemented on a terminal.

Fig. 7 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application. The cloud server is used for running a video generation method and executing any of the method embodiments, so that at least one target video is automatically extracted from the original video, and the quality of the target video is ensured. As shown in fig. 7, the cloud server includes: a memory 73 and a processor 74.

Memory 73 is used to store computer programs and may be configured to store various other data to support operations on the cloud server. The memory 73 may be an object store (Object Storage Service, OSS).

The memory 73 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A processor 74 coupled to the memory 73 for executing a computer program in the memory 73 for executing the video generating method provided by any of the foregoing embodiments

Further, as shown in fig. 7, the cloud server further includes: firewall 71, load balancer 72, communication component 75, power component 76, and other components. Only some components are schematically shown in fig. 7, which does not mean that the cloud server only includes the components shown in fig. 7.

Correspondingly, the embodiment of the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the electronic device to perform the steps of the method embodiments described above.

Accordingly, embodiments of the present application also provide a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to carry out the steps of the above-described method embodiments.

Accordingly, embodiments of the present application also provide a computer program product comprising a computer program/instructions which, when executed by a processor, cause the processor to carry out the steps of the above-described method embodiments.

The communication component 75 of fig. 7 described above is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component 75 is located may access a wireless network based on a communication standard, such as a WiFi,2G, 3G, 4G/LTE, 5G, or the like mobile communication network, or a combination thereof. In one exemplary embodiment, the communication component 75 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 75 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The power supply assembly 76 of fig. 7 provides power to the various components of the device in which the power supply assembly 76 is located. The power supply components 76 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the devices in which the power supply components 76 are located.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A video generation method applied to a server, comprising:

responding to a video generation request of a client, and acquiring an original video related to a recommended object;

performing multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text fragments corresponding to voices in the original video;

according to the image identification information, carrying out category prediction on the recommended objects in the original video to obtain target categories to which the recommended objects in the original video belong;

identifying the text segments according to the attribute corresponding to the target category to which the recommended object in the original video belongs, so as to obtain candidate text segments containing effective explanation information, wherein the candidate text segments are marked with time information;

Extracting video clips corresponding to the candidate text clips from the original video according to the time information marked on the candidate text clips;

and merging the video clips describing the same recommended object into the same video according to the image identification information of the image frames in the video clips to obtain at least one target video, wherein the target video is one video clip in the original video or a combination of a plurality of video clips in the original video.

2. The method of claim 1, wherein a knowledge graph is pre-constructed, the knowledge graph includes at least one image of a recommended object under a category, the category prediction is performed on the recommended object in the original video according to the image identification information, and a target category to which the recommended object in the original video belongs is obtained, and the method comprises:

and performing image matching on the image of the recommended object under at least one category in the knowledge graph and the image identification information, and determining the target category according to an image matching result.

3. The method of claim 2, wherein the knowledge graph further includes attributes of recommended objects under at least one category, and the identifying the text segment according to the text segment and the target category to obtain a candidate text segment including effective explanation information includes:

Acquiring the attribute of the recommended object under the target category in the knowledge graph;

and carrying out text matching on the text fragments and the attributes of the recommended objects under the target category to obtain text fragments containing the attributes of the recommended objects under the target category, and determining the candidate text fragments as text fragments containing the attributes of the recommended objects under the target category.

4. A video generation method according to any one of claims 1 to 3, wherein said processing the original video according to the candidate text segment to obtain the target video comprises:

screening the candidate text fragments according to content quality requirements;

and processing the original video according to the screened candidate text fragments to obtain the target video.

5. A video generation method applied to a client, comprising:

in response to the interactive operation of the user on the original video related to the recommended object, sending a video generation request to a server to request video generation based on the original video;

receiving at least one target video returned by a server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, the target video is obtained by merging video segments describing the same recommended object according to image identification information of image frames in the video segments after extracting the video segments corresponding to the candidate text segments from the original video according to time information marked on the candidate text segments; the candidate text segment is a video segment containing effective explanation information, which is obtained according to the attribute corresponding to the target category to which the recommended object in the original video belongs after the recommended object in the original video belongs is obtained by carrying out category prediction on the recommended object in the original video according to the image identification information of the image frame in the original video.

6. A video generating apparatus applied to a server, comprising:

the acquisition unit is used for responding to the video generation request of the client and acquiring the original video related to the recommended object;

the recognition unit is used for carrying out multi-mode feature recognition on the original video to obtain image recognition information of image frames in the original video and text fragments corresponding to voices in the original video;

the extraction unit is used for carrying out category prediction on the recommended objects in the original video according to the image identification information to obtain target categories to which the recommended objects in the original video belong;

7. A video generating apparatus, applied to a client, comprising:

a transmitting unit configured to transmit a video generation request to a server to request video generation based on an original video related to a recommended object in response to an interactive operation of a user with respect to the original video;

the receiving unit is used for receiving at least one target video returned by the server, wherein the target video is one video segment in the original video or a combination of a plurality of video segments in the original video, the target video is obtained by merging video segments describing the same recommendation object according to image identification information of image frames in the video segments after extracting the video segments corresponding to the candidate text segments from the original video according to time information marked on the candidate text segments; the candidate text segment is a video segment containing effective explanation information, which is obtained according to the attribute corresponding to the target category to which the recommended object in the original video belongs after the recommended object in the original video belongs is obtained by carrying out category prediction on the recommended object in the original video according to the image identification information of the image frame in the original video.

8. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the electronic device to perform the video generation method of any one of claims 1 to 5.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the video generation method of any one of claims 1 to 5.