CN113709529B

CN113709529B - Video synthesis method, device, electronic equipment and computer readable medium

Info

Publication number: CN113709529B
Application number: CN202110396622.0A
Authority: CN
Inventors: 陈姿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2023-07-14
Anticipated expiration: 2041-04-13
Also published as: CN113709529A

Abstract

Embodiments of the present disclosure provide a video compositing method, apparatus, electronic device, and computer-readable medium. The method comprises the following steps: acquiring a video composition request of a first video, wherein the video composition request comprises a target keyword; determining at least one target behavioral data matching the target keyword; determining a plurality of target video segments in the first video, which are matched with the target behavior data; and splicing the target video clips to obtain a second video. The video synthesis method of the embodiment of the disclosure can be deployed in a cloud server for parallel computation. The video synthesis method, the video synthesis device, the electronic equipment and the computer readable medium can realize rapid and automatic generation of videos, save manpower and material resources and improve efficiency.

Description

Video synthesis method, device, electronic equipment and computer readable medium

Technical Field

The disclosure relates to the technical field of video synthesis, and in particular relates to a video synthesis method, a video synthesis device, electronic equipment and a computer readable medium.

Background

Network technology has found great use in human life scenarios. In order to meet the utilization of the fragmentation time of the user, the number of viewers and the viewing frequency of the short video are greatly improved. For a complete movie work, a short video of the album type may be generated for viewing by the user by cutting out the highlight clips therein.

For continuously pushing out new various film and television works, short video producers usually intercept highlight segments to produce short videos after watching complete film and television works. This will take a lot of time and its cost of labor is too high, and subjectivity is stronger, can not satisfy all kinds of user's demands.

Accordingly, there is a need for a new video compositing method, apparatus, electronic device, and computer readable medium.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure provides a video synthesis method, a video synthesis device, electronic equipment and a computer readable medium, so that automatic generation of videos is realized at least to a certain extent, manpower and material resources are saved, and efficiency is improved.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

The embodiment of the disclosure provides a video synthesis method, which comprises the following steps: acquiring a video composition request of a first video, wherein the video composition request comprises a target keyword; determining at least one target behavioral data matching the target keyword; determining a plurality of target video segments in the first video, which are matched with the target behavior data; and splicing the target video clips to obtain a second video.

The embodiment of the disclosure provides a video synthesizing device, which comprises: the request receiving module is configured to acquire a video composition request of a first video, wherein the video composition request comprises a target keyword; a behavior matching module configured to determine at least one target behavior data matching the target keyword; a behavior recognition module configured to determine a plurality of target video clips in the first video that match the target behavior data; and the video synthesis module is configured to splice the plurality of target video clips to obtain a second video.

In one exemplary embodiment of the present disclosure, the behavior recognition module includes: the video segment identification unit is configured to process the first video and the target behavior data to obtain a plurality of video segments to be sequenced in the first video and the matching degree of each video segment to be sequenced and the target behavior data; the sorting value determining unit is configured to determine a target sorting value of each sorted video fragment according to the matching degree of each video fragment to be sorted; and the video segment sorting unit is configured to sort the plurality of video segments to be sorted according to the target sorting value, determine m video segments to be sorted which are ranked m before as a plurality of target video segments matched with the target behavior data, and m is an integer greater than 0.

In one exemplary embodiment of the present disclosure, the ranking value determining unit includes: the pop-up comment number subunit is configured to obtain the pop-up comment number of each video clip to be sequenced; the playing heat subunit is configured to obtain the playing heat of each video segment to be sequenced according to the pop-up comment number of each video segment to be sequenced; and the sequencing value determining subunit is configured to perform weighted calculation on the matching degree of the plurality of video segments to be sequenced and the playing heat of the video segments to be sequenced to obtain target sequencing values of the plurality of video segments to be sequenced.

In one exemplary embodiment of the present disclosure, the target behavior data includes expression data; wherein the video clip identifying unit includes: the face detection subunit is configured to perform face detection on the video frames in the first video to obtain faces to be identified; the feature extraction subunit is configured to perform feature extraction on the face to be identified to obtain the feature of the face to be identified; the expression similarity calculation subunit is configured to process the face features to be identified and the expression data through an expression identification model to obtain expression similarity of the face features to be identified and the expression data; the first video segment identification subunit is configured to determine a video frame where the face feature to be identified with the expression similarity greater than the expression similarity threshold is located as the plurality of video segments to be sequenced.

In one exemplary embodiment of the present disclosure, the target behavior data includes action data; the video clip identifying unit includes: a person detection subunit configured to perform person detection on video frames in the first video, and obtain a plurality of video frame sequences according to a detection result; the motion characteristic recognition subunit is configured to process the plurality of video frame sequences and the motion data through a motion recognition model to obtain motion characteristics of each video frame sequence; a motion similarity calculation subunit configured to determine motion characteristics of each video frame sequence and motion similarity of the motion data; and the second video segment identification subunit is configured to determine a video frame sequence with the action similarity larger than the action similarity threshold as the plurality of video segments to be sequenced.

In an exemplary embodiment of the present disclosure, the video compositing apparatus further includes: the keyword acquisition module is configured to acquire the occurrence times of keyword information of the first video, wherein the keyword information comprises one or more of pop-up comment information, static comment information and tag information; the keyword ordering module is configured to order the keyword information according to the occurrence times; a keyword classification module configured to classify the keyword information of the k before ranking to divide the keyword information of the k before ranking into at least one keyword set, k being an integer greater than 0; and the behavior matching module is configured to determine behavior data corresponding to the first video according to each keyword set.

In one exemplary embodiment of the present disclosure, the behavior matching module includes: a keyword similarity unit configured to calculate keyword similarity of the target keyword and each keyword information in the keyword set; a keyword set determining unit configured to determine a keyword set having the greatest keyword similarity as a target keyword set corresponding to the first video; and the behavior matching unit is configured to determine the behavior data corresponding to the target keyword set as a plurality of target behavior data matched with the target keyword.

In one exemplary embodiment of the present disclosure, the keyword classification module includes: a word vector representation unit configured to obtain keyword vectors of the keyword information of the top k of the rank; and the keyword classification unit is configured to divide the keyword information with the similarity between the keyword vectors smaller than a word vector similarity threshold value into the same set to obtain the at least one keyword set.

In one exemplary embodiment of the present disclosure, the behavior matching module includes: the behavior matching value determining unit is configured to process the keyword information in each keyword set to obtain a behavior matching value of each keyword information and each behavior data to be matched; the keyword and behavior matching unit is configured to determine behavior data to be matched with the maximum behavior matching value as behavior data corresponding to the keyword information; and the behavior matching unit is configured to determine the behavior data corresponding to the keyword information in the keyword set as the behavior data corresponding to the first video.

An embodiment of the present disclosure proposes an electronic device including: at least one processor; and a storage means for storing at least one program which, when executed by the at least one processor, causes the at least one processor to implement the video compositing method as described in the above embodiments.

The embodiments of the present disclosure propose a computer-readable medium, on which a computer program is stored, which when being executed by a processor implements the video compositing method as described in the above embodiments.

In some embodiments of the present disclosure, when a video composition request for a first video is received, at least one target behavior data matching the video composition request is determined based on a target keyword included in the video composition request, and a target video clip matching the target behavior data is intercepted from the first video based on the at least one target behavior data, and a composite second video is obtained by stitching the target video clip. Because the target video segments in the second video are matched with the target behavior data and the target behavior data are matched with the target keywords, the method can ensure that the target video segments highly related to the target keywords are intercepted from the first video based on the target keywords, and the second video highly matched with the target keywords is generated based on the target video segments, so that the rapid and automatic generation of the video is realized, the manpower and material resources are saved, and the efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which the video compositing methods or apparatuses of embodiments of the present disclosure may be applied.

Fig. 2 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure.

Fig. 3 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure.

Fig. 4 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure.

Fig. 5 is a timing diagram of a method of acquiring video clips to be ordered according to a video composition method of an embodiment of the present disclosure.

Fig. 6 is a flowchart of a method of acquiring video clips to be ordered according to a video composition method of an embodiment of the present disclosure.

Fig. 7 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure.

Fig. 8 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure.

Fig. 9 schematically shows a block diagram of a video compositing apparatus according to an embodiment of the disclosure.

Fig. 10 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Fig. 11 schematically illustrates a presentation schematic of a second video in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in at least one hardware module or integrated circuit, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, portable computers, desktop computers, wearable devices, virtual reality devices, smart homes, etc.

The server 105 may be a server providing various services. For example, the server 105 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

In the embodiment of the present disclosure, the terminal device 103 (may also be the terminal device 101 or 102) may upload the video composition request to the server 105. The server 105 may obtain a video composition request for the first video, the video composition request including a target keyword; determining at least one target behavioral data matching the target keyword; determining a plurality of target video segments in the first video, which are matched with the target behavior data; and splicing the target video clips to obtain a second video. And feeds the second video back to the terminal device 103, so that the terminal device 103 can display the second video or perform subsequent operations such as auditing, online, and the like. Fig. 11 schematically illustrates a presentation schematic of a second video in an embodiment of the present disclosure. As shown in fig. 11, a cover of the second video 1110 may be displayed, for example, on a screen of the terminal device 103, and played after receiving a click operation of the second video 1110 by the user.

Fig. 2 schematically illustrates a flow chart of a video compositing method according to an embodiment of the disclosure. The method provided in the embodiments of the present disclosure may be processed by any electronic device having computing processing capability, for example, the server 105 and/or the

terminal devices

102 and 103 in the embodiment of fig. 1, and in the following embodiments, the server 105 is taken as an example to illustrate the execution subject, but the present disclosure is not limited thereto.

As shown in fig. 2, the video compositing method provided by the embodiment of the disclosure may include the following steps.

In step S210, a video composition request of the first video is acquired, the video composition request including a target keyword.

In the embodiment of the disclosure, the first video is an original video for capturing video clips to make a new video, and may be, for example, a complete movie work. Wherein the video composition request may be generated by a worker operation of the video composition, and the target keyword in the video composition request may be entered by the worker through an input device. Preferably, the video composition request may be initiated by an execution statement in a preset program, and the target keyword in the video composition request may be generated by extracting keywords from historical comment information, tag information, and the like of the first video. The historical comment information may be, for example, pop-up comment information in the playing of the first video, still comment information, topic comments in the virtual forum, and the like.

In step S220, at least one target behavior data matching the target keyword is determined.

In the disclosed embodiment, the at least one target behavior data that matches the target keyword is a generalization of behaviors that may be made in the case indicated by the target keyword. For example, when the target keyword is "sweet", the target behavior data matching the target keyword is, for example, "hug", "shoulder rest", or the like. In a preferred embodiment, the behavior data corresponding to each keyword information may be stored first, for example, in the form of a mapping table. For the same kind of keyword information (the same kind of keyword information belongs to a keyword set) in the mapping table (hereinafter referred to as a behavior list database), the kind of keyword may correspond to one or more pieces of behavior data. After determining the target keyword, the target keyword and the keyword information in each keyword set may be matched, the successfully matched keyword set is determined as the target keyword set, and at least one behavior data corresponding to the successfully matched keyword set is determined as at least one target behavior data matched with the target keyword.

In step S230, a plurality of target video clips in the first video that match the target behavior data are determined.

In the embodiment of the disclosure, the plurality of target video clips are video clips intercepted in the first video. The video frames in the first video can be classified and identified through a machine learning model, so that the video frames with the matching degree with the target behavior data higher than a certain threshold value in the first video are obtained, and the video frames are integrated into a plurality of target video fragments matched with the target behavior data.

In step S240, the plurality of target video clips are spliced to obtain a second video.

In the embodiment of the disclosure, the target video segment is a video segment intercepted in the first video, and the synthesized second video can be obtained by splicing a plurality of target video segments.

According to the video synthesis method provided by the embodiment of the disclosure, when a video synthesis request for a first video is received, at least one target behavior data matched with the video synthesis request is determined based on a target keyword included in the video synthesis request, a target video fragment matched with the target behavior data is intercepted from the first video based on the at least one target behavior data, and a synthesized second video is obtained through splicing the target video fragments. Because the target video segments in the second video are matched with the target behavior data and the target behavior data are matched with the target keywords, the method can ensure that the target video segments highly related to the target keywords are intercepted from the first video based on the target keywords, and the second video highly matched with the target keywords is generated based on the target video segments, so that the rapid and automatic generation of the video is realized, the manpower and material resources are saved, and the efficiency is improved.

As shown in fig. 3, the video compositing method based on the above embodiment may further include the following steps.

In step S310, the number of occurrences of keyword information of the first video is obtained, the keyword information including one or more of pop-up comment information, still comment information, and tag information.

In the embodiment of the disclosure, the pop-up comment information may be a bullet screen input in a playing screen by a viewer user in a video playing process. The static comment information can be comment characters input by a viewer user in a static comment area in a video play page. The tag information may be tag information of the first video itself, for example. For example, when the first video is a television show, the tag information of the first video may be, for example, a type of the television show, such as comedy, suspense, action, and the like.

In step S320, the keyword information is ordered according to the number of occurrences.

In the embodiment of the disclosure, the keyword information may be ranked from large to small.

In step S330, the keyword information of the top k, k being an integer greater than 0, is classified to divide the keyword information of the top k into at least one keyword set.

In the embodiment of the disclosure, keyword vectors of keyword information of k top ranks can be obtained; and dividing the keyword information with the similarity between the keyword vectors smaller than the keyword vector similarity threshold value into the same set to obtain at least one keyword set. The similarity may be obtained by calculating cosine similarity. Cosine similarity is also called cosine similarity, and is obtained by performing embedded representation on keyword information to obtain keyword vectors of each piece of keyword information. For any two pieces of keyword information, the similarity of the two pieces of keyword information is evaluated by calculating the cosine value of the included angle of the keyword vectors of the two pieces of keyword information.

In step S340, behavior data corresponding to the first video is determined according to each keyword set.

In embodiments of the present disclosure, each keyword set may correspond to a plurality of behavioral data. Such as the behavior list database referred to in the previous embodiments. For each keyword set, at least one behavior data may be corresponding. For example, the keyword information "sugar-spread", "dog food", "sweet" is of the same type (i.e., the same keyword set), which is recorded as the same type in the behavior list database. Behavior data corresponding to the keyword set is behavior data corresponding to the type in a behavior list database. When determining the behavior data corresponding to the first video according to each keyword set, processing the keyword information in each keyword set (for example, processing through a machine learning model) to obtain a behavior matching value of each keyword information and each behavior data to be matched; determining behavior data to be matched with the maximum behavior matching value as behavior data corresponding to the keyword information; and determining the behavior data corresponding to the keyword information in the keyword set as the behavior data corresponding to the first video.

In step S350, a video composition request of the first video is acquired, the video composition request including a target keyword.

In the embodiment of the disclosure, the target keyword may be generated according to manual input information, and may also be obtained according to interaction data of the audience user of the first video in a specified time range. The interaction data may be, for example, one or more of praise data, concentration, and interaction frequency for comment information (pop-up comment information and/or static comment information). Keywords can be extracted according to the interaction data and ranked according to the interaction frequency, and the keyword with the highest frequency is determined to be the target keyword.

In step S360, keyword similarity of the target keyword and each keyword information in the keyword set is calculated.

In the embodiment of the disclosure, the keyword set may be a keyword set recorded in a behavior list database. Keyword similarity may be characterized by cosine similarity. For example, each keyword information in the target keyword and the keyword set may be expressed in terms of a word vector through embedding, and then cosine values between the word vectors are used as the keyword similarity.

In step S370, the keyword set having the greatest keyword similarity is determined as the target keyword set corresponding to the first video.

In the embodiment of the present disclosure, the keyword similarity of each keyword set is an average value or a mean value or a sum value of keyword similarities of each keyword information in the keyword set, which is not particularly limited in the embodiment of the present disclosure. For each keyword set, determining the keyword set with the maximum keyword similarity as a target keyword set corresponding to the first video.

In step S380, the behavior data corresponding to the target keyword set is determined as a plurality of target behavior data matching the target keywords.

In the embodiment of the disclosure, the behavior data corresponding to the target keyword set can be determined through the behavior list database.

In step S390, a plurality of target video clips in the first video that match the target behavior data are determined.

In the embodiment of the disclosure, the video frames in the first video may be processed through a machine learning model to obtain the matching degree of each video frame and the target behavior data, and then the video segment where the video frame with the matching degree greater than the matching degree threshold is located is determined as the target video segment matched with the target behavior data.

In step S395, the plurality of target video clips are spliced to obtain a second video.

Step S395 of the embodiments of the present disclosure may take steps similar to step S240, and is not repeated herein.

In an embodiment of the present disclosure, the execution body of the method may further include a short video module, and after the second video is obtained, the second video may be pushed to the short video module according to the video name and the target keyword of the first video. For example, the video name, the target keyword and the second video of the first video may be packaged according to a predetermined format, and the packaged data obtained by the packaging may be pushed to the short video module, so that after the short video module parses the packaged data according to the predetermined format, the video name and the target keyword of the first video are displayed, and after receiving the click operation of the user on the video name and the target keyword of the first video, the second video is played.

In this embodiment, keyword information having a certain correlation with the first video can be obtained by screening by counting and sorting the keyword information of the first video, and behavior data corresponding to the first video can be determined in advance based on the similarity between the keyword information and the behavior data in each keyword set by classifying the keyword information of the k before ranking and dividing it into at least one keyword set. When a video composition request is received, the similarity between the target keywords included in the video composition request and the keyword information in each keyword set is calculated, the target keyword set most relevant to the target keywords is determined based on the similarity, and then a plurality of target behavior data matched with the target keywords are determined according to the behavior data corresponding to each keyword set obtained in advance, so that the rapid automatic generation of the subsequent videos is realized, manpower and material resources are saved, and the efficiency is improved.

As shown in fig. 4, the video compositing method of an embodiment of the present disclosure may include the following steps.

In step S410, a video composition request of the first video is acquired, the video composition request including a target keyword.

Step S410 of the embodiments of the present disclosure may take steps similar to step S210, and will not be described here.

In step S420, at least one target behavior data matching the target keyword is determined.

Step S420 of the embodiments of the present disclosure may take steps similar to step S220 or steps S360-S380, which are not described herein.

In step S430, the first video and the target behavior data are processed to obtain a plurality of video segments to be ordered in the first video, and a matching degree between each video segment to be ordered and the target behavior data.

In the embodiment of the disclosure, the video frames in the first video can be classified and identified through a machine learning model, so that the video frames with the matching degree with the target behavior data higher than a certain threshold in the first video are obtained and integrated into a plurality of target video fragments matched with the target behavior data.

In step S440, a target ranking value of each ranked video clip is determined according to the matching degree of each video clip to be ranked.

In the embodiment of the present disclosure, the matching degree of each video segment to be sequenced may be used as a target sequencing value of the video segments to be sequenced, or the matching degree may be quantized to obtain the target sequencing value, which is not particularly limited in the present disclosure.

In an exemplary embodiment, a popup comment count for each video clip to be ranked may be obtained; obtaining the playing heat of each video segment to be sequenced according to the pop-up comment number of each video segment to be sequenced; and carrying out weighted calculation on the matching degree of the plurality of video clips to be sequenced and the playing heat of the video clips to be sequenced to obtain target sequencing values of the plurality of video clips to be sequenced.

When the playing hotness of the video clips to be ordered is obtained, the number of pop-up comments in each video clip to be ordered can be determined as the playing hotness of the video clips to be ordered. For another example, the number of times each video clip to be sequenced is played on the playing platform can be considered, and a weighted value (or a mean value or a maximum value) based on the number of times each video clip to be sequenced is played and the number of pop-up comments is taken as the playing heat of each video clip to be sequenced. The foregoing is merely examples, and the specific generation manner of the playing heat is not particularly limited in the present disclosure.

In step S450, the video segments to be ranked are ranked according to the target ranking value, and the m top m video segments to be ranked are determined as target video segments matching the target behavior data, where m is an integer greater than 0.

In the embodiment of the disclosure, the value of m may be determined according to practical situations. For example, the value of m may satisfy the following condition: the sum of the durations of the first m video clips to be sequenced is within a preset duration range (e.g., greater than the first duration and less than the second duration).

In the embodiment, the first video is processed based on the target behavior data matched with the target keywords as a guide, so that the video segments to be sequenced in the first video with higher matching degree with the target behavior data can be obtained, matching of the target keywords and the video segments to be sequenced can be achieved, and the second video can be conveniently generated based on the video segments to be sequenced which are highly matched with the target keywords.

In an exemplary embodiment, the target behavior data may include expression data, and in step S430, face detection may be performed on a video frame in the first video to obtain a face to be recognized; extracting features of the face to be identified to obtain features of the face to be identified; processing the facial features to be identified and the expression data through the expression identification model to obtain the expression similarity of the facial features to be identified and the expression data; and determining the video frames in which the facial features to be identified with the expression similarity larger than the expression similarity threshold are located as a plurality of video segments to be sequenced. The expression recognition model may be a neural network model. The expression similarity of the face features to be identified and the expression data represents the similarity degree of the character expression of the face features to be identified and the expression indicated by the expression data.

Before face detection is performed on the video frames in the first video, image preprocessing may also be performed on the video frames of the first video. Fig. 5 is a timing diagram of a method of acquiring video clips to be ordered according to a video composition method of an embodiment of the present disclosure. The method for obtaining the video clips to be sequenced as shown in fig. 5 may be an expression packet extraction system based on face recognition, and mainly includes: the system comprises a database, a feature training module and an expression recognition module. The database adopts a JAFFE expression database. (JAFFE database is an open facial expression image database, comprising KA, KL, KM, KR, MK, NA, NM, TM, UY, YM total 10 different Japanese females, each having AN, DI, FE, HA, NE, SA, SU facial images of 7 different expressions, anger, aversion, fear, happiness, no expression, sadness, surprise, 3, 4 sample images per expression, 213 total expression images. Raw image is 256X 256 pixels).

The main idea is: a standard module is established for each expression, the facial expression to be detected is matched with expression templates of various standards, and the matching reading is closer to the expression. The expressions (aversion, fear, sadness, surprise, happiness) in the database are mostly exaggerated, so that the test expression is also exaggerated when the matching degree is high. Facial expression recognition is detected and matched on a database.

The flow steps may be as follows.

(1) According to the action initiation request, the expression of a certain movie, such as shy and sad, is extracted.

(2) The movie is loaded to an expression extraction system.

(3) Face detection from a database.

(4) And preprocessing the pictures of the database pictures.

(5) And extracting the characteristics of the database pictures.

(6) And (5) detecting the human face of the video.

(7) Preprocessing video face pictures.

(8) And extracting the characteristics of the video face picture.

(9) The expression recognition matching of the classifier (namely, the expression recognition model) is trained.

(10) Returning the result to the system, and storing the film time slices of the appearance of the matching expression of the film to a server.

For example, a movie time point+a person (which may be determined by face recognition) +play heat of the appearance of the matching expression of the movie may be stored to the server as a stored key.

In an exemplary embodiment, the target behavior data may include motion data, and in step S430, person detection may be performed on video frames in the first video, and a plurality of video frame sequences may be obtained according to the detection result; processing the plurality of video frame sequences and the motion data through the motion recognition model to obtain motion characteristics of each video frame sequence; determining action characteristics of each video frame sequence and action similarity of action data; and determining the video frame sequence with the motion similarity larger than the motion similarity threshold as a plurality of video segments to be sequenced. In the video compositing method provided by the present disclosure, the target behavior data may include one or both of expression data and motion data, which is not particularly limited by the present disclosure. The motion recognition model may be a neural network model, such as a three-dimensional convolutional deep neural network model, but the technical solution of the present disclosure is not particularly limited thereto. The motion recognition model may be used to recognize a sequence of video frames and output motion features that characterize the motion behavior of a person in the sequence of video frames. The motion characteristics of the video frame sequence and the motion similarity of the motion data are indicative of the degree of similarity of the person's motion in the video frame sequence and the motion indicated by the motion data. The action similarity can be calculated and obtained according to a cosine similarity calculation mode.

Fig. 6 is a flowchart of a method for acquiring video clips to be sequenced according to a video synthesizing method according to an embodiment of the present disclosure. As shown in fig. 6, the flow of processing the first video by the motion recognition model is as follows.

(1) And (5) loading data. The data loading includes dividing the data set into a validation set, a training set, and a test set, and scrambling the data.

(2) And (5) constructing a network. The network construction includes network design, network parameter initialization, and overfitting prevention design. The network construction of the disclosed embodiments may be a deep neural network model based on three-dimensional convolution.

(3) Classification functions and Loss functions (Loss) are defined. The action recognition belongs to multi-label classification, and is most commonly referred to as Softmax regression, and the principle is that features which can be judged as a certain class are added when multi-label tasks are processed, and then the features are converted into judgment. The Loss is the classification precision of the description model on the problem, namely the deviation of the classification result and the true value, and the classification precision is the target of the training process by continuously reducing the classification result to achieve the global optimum or the local optimum. Common loss functions are the minimum Mean Square Error (MSE), the hinge loss function (hinge loss), and the cross entropy function (cross entropy loss). The most common Loss for Softmax regression is the cross entropy function, which is the sum of the true probability distribution One-hot encoded logarithm multiplied by the predicted probability distribution.

(4) And (5) defining an optimizer. Adam optimization algorithm was used as the default optimization algorithm. Adam optimization algorithm is a first order optimization algorithm that can replace the traditional random gradient descent process and can iteratively update neural network weights based on training data.

(5) Training and verification process. The process is to feed data to a model by continuously iterating and taking mini-batch as a unit, calculate gradients at the same time, update learning parameters and return to the current accuracy and loss. The verification is performed on a verification set at intervals to evaluate the prediction performance of the model in a training stage, the data is divided by a k-fold method in general, network forward operation is performed on the training set and the verification set after each round (epoch) or each batch (mini-batch) training, a training set and a verification set sample mark (label) are predicted, and a learning curve is drawn to test the generalization capability of the model. The validation set and training set need not be separate, but the test set need be completely separate.

(6) And (5) testing. And finally, calculating and recording the result. And storing the movie time point+the person (which can be determined through face recognition) +the action data (such as tumbling) +the playing heat of the matching action of the movie as a stored key word to a server. Wherein the stored play warmth may be used to perform a weighted calculation to obtain a target ranking value for the video clips to be ranked.

As shown in fig. 7, the video compositing method of an embodiment of the present disclosure may include the following steps.

In step S710, keyword information of a first video is retrieved.

In the embodiment of the present disclosure, the retrieval of the keyword information may take steps similar to step S310, and will not be described here.

In step S720, a plurality of behavior data corresponding to the keyword information is determined.

In step S730, a video clip corresponding to each behavior data in the first video is identified and obtained.

In step S740, the behavior data matched with the target keyword is determined as target behavior data, a target video segment is determined based on the target behavior data, and the target video segment is spliced to obtain the second video. The video clips meeting the conditions shown in fig. 7 are target video clips obtained by matching target keywords.

In step S750, a video title of the second video is generated from the video title of the first video and the target keyword.

In the embodiment of the disclosure, the video content is automatically intercepted through the video watching heat of the user and the video content type identification, and characters understood through videos are used as titles. Such as key information of ancient dress love movie works (i.e., first video): sweet, corresponding behavioural data: hugging, leaning against the shoulder, etc. And searching target video fragments corresponding to the behavior data in the first video by video, and finally outputting a second video by splicing. The video title of the second video generated from the video title of the first video and the target keyword may be retrieved by the keyword. The scheme can greatly reduce the operation labor cost of video synthesis.

As shown in fig. 8, the video compositing method of the present embodiment may include the following steps.

In step 1, keyword information of the first video, such as joke, tear, sweet, etc., is requested according to pop-up comment information (barrage), still comment information, video tag of the first video.

In step 2, the top k keyword information is returned.

In step 3, semantic understanding is performed according to the keyword information, so that the keyword information of the same type, such as sugar, dog food and sweet, can be integrated according to the semantic understanding result. And then synthesizing different types of videos according to different types of ranks.

In step 4, a video composition request of the first video is obtained, the video composition request including a target keyword.

In step 5, behavior data including motion data and expression data is determined from the target keyword. For example, the target keyword is sweet, and corresponding behavior data (such as actions: hug, shoulder rest and expression: smile) are matched in the behavior list database.

In step 6, storing the corresponding video clips through the expression recognition model and the action recognition model, calculating the playing heat through the pop-up comment information, and sorting and screening the video clips according to the playing heat.

The specific recognition process of the table condition recognition model and the action recognition model in the embodiments of the present disclosure may refer to the foregoing embodiments, and will not be described herein.

In step 7 (not shown in fig. 8), the video clips obtained by the screening are spliced to obtain a second video.

And generating the video title of the second video in a splicing mode of the video title of the first video and the target keyword.

The following describes embodiments of the apparatus of the present disclosure that may be used to perform the video compositing methods described above of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the video compositing method described above in the present disclosure.

Referring to fig. 9, a video compositing apparatus 900 according to an embodiment of the disclosure may include: a request receiving module 910, a behavior matching module 920, a behavior recognition module 930, and a video compositing module 940.

The request receiving module 910 may be configured to obtain a video composition request of the first video, the video composition request including the target keyword.

The behavior matching module 920 may be configured to determine at least one target behavior data that matches the target keyword.

The behavior recognition module 930 may be configured to determine a plurality of target video clips in the first video that match the target behavior data.

The video composition module 940 may be configured to splice a plurality of target video clips to obtain a second video.

The video composing apparatus provided by the embodiment of the disclosure, when receiving a video composing request for a first video, determines at least one target behavior data matching the target keyword included in the video composing request based on the target keyword, and intercepts a target video clip matching the target behavior data from the first video based on the at least one target behavior data, and obtains a composed second video by splicing the target video clips. Because the target video segments in the second video are matched with the target behavior data and the target behavior data are matched with the target keywords, the method can ensure that the target video segments highly related to the target keywords are intercepted from the first video based on the target keywords, and the second video highly matched with the target keywords is generated based on the target video segments, so that the rapid and automatic generation of the video is realized, the manpower and material resources are saved, and the efficiency is improved.

In an exemplary embodiment, the behavior recognition module 930 may include: the video segment identification unit can be configured to process the first video and the target behavior data to obtain a plurality of video segments to be sequenced in the first video and the matching degree of each video segment to be sequenced and the target behavior data; the sorting value determining unit can be configured to determine a target sorting value of each sorted video segment according to the matching degree of each video segment to be sorted; a video clip ordering unit configured to order the plurality of video clips to be ordered according to the target ordering value, and determining the m top ranked video clips to be ranked as a plurality of target video clips matched with the target behavior data, wherein m is an integer greater than 0.

In an exemplary embodiment, the ranking value determining unit may include: the pop-up comment number subunit is configured to obtain the pop-up comment number of each video clip to be sequenced; the playing heat subunit can be configured to obtain the playing heat of each video segment to be sequenced according to the pop-up comment number of each video segment to be sequenced; the ranking value determining subunit may be configured to perform weighted calculation on the matching degrees of the plurality of video clips to be ranked and the playing heat of the video clips to obtain target ranking values of the plurality of video clips to be ranked.

In an exemplary embodiment, the target behavior data may include expression data; wherein the video clip identifying unit may include: the face detection subunit can be configured to perform face detection on the video frames in the first video to obtain faces to be identified; the feature extraction subunit can be configured to perform feature extraction on the face to be identified to obtain the feature of the face to be identified; the expression similarity calculation subunit can be configured to process the facial features to be identified and the expression data through the expression identification model to obtain the expression similarity of the facial features to be identified and the expression data; the first video segment recognition subunit may be configured to determine, as a plurality of video segments to be sequenced, a video frame in which a face feature to be recognized whose expression similarity is greater than an expression similarity threshold is located.

In an exemplary embodiment, the target behavior data may include action data; the video clip identifying unit may include: the person detection subunit is configured to perform person detection on the video frames in the first video and obtain a plurality of video frame sequences according to detection results; the motion characteristic recognition subunit is configured to process the plurality of video frame sequences and the motion data through the motion recognition model to obtain motion characteristics of each video frame sequence; a motion similarity calculation subunit configured to determine motion characteristics of each video frame sequence and motion similarity of the motion data; the second video segment identification subunit may be configured to determine a sequence of video frames having an action similarity greater than an action similarity threshold as a plurality of video segments to be ordered.

In an exemplary embodiment, the video composing apparatus may further include: the keyword acquisition module can be configured to acquire the occurrence number of keyword information of the first video, wherein the keyword information comprises one or more of pop-up comment information, static comment information and tag information; the keyword ordering module can be configured to order the keyword information according to the occurrence times; the keyword classification module can be configured to classify the keyword information of k before ranking so as to divide the keyword information of k before ranking into at least one keyword set, wherein k is an integer greater than 0; and the behavior matching module can be configured to determine behavior data corresponding to the first video according to each keyword set.

In an exemplary embodiment, the behavior matching module 920 may include: the keyword similarity unit is configured to calculate the keyword similarity of each keyword information in the target keyword and the keyword set; the keyword set determining unit may be configured to determine a keyword set having the greatest keyword similarity as a target keyword set corresponding to the first video; and a behavior matching unit configured to determine behavior data corresponding to the target keyword set as a plurality of target behavior data matching the target keyword.

In an exemplary embodiment, the keyword classification module may include: a word vector representation unit configurable to obtain keyword vectors of keyword information of top k of the rank; and the keyword classification unit can be configured to divide keyword information with similarity among keyword vectors smaller than a word vector similarity threshold value into the same set to obtain at least one keyword set.

In an exemplary embodiment, the behavior matching module may include: the behavior matching value determining unit can be configured to process the keyword information in each keyword set to obtain a behavior matching value of each keyword information and each behavior data to be matched; the keyword and behavior matching unit can be configured to determine behavior data to be matched with the maximum behavior matching value as behavior data corresponding to the keyword information; and the behavior matching unit can be configured to determine the behavior data corresponding to the keyword information in the keyword set as the behavior data corresponding to the first video.

Fig. 10 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that, the electronic device 1000 shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 10, the electronic apparatus 1000 includes a Central Processing Unit (CPU) 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1001, ROM1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage section 1008 as needed.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, the computer program performs various functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having at least one wire, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units and/or sub-units referred to in the embodiments of the present disclosure may be implemented in software or hardware, and the described modules and/or units and/or sub-units may be disposed in a processor. Wherein the names of the modules and/or units and/or sub-units do not in some cases constitute a limitation of the modules and/or units and/or sub-units themselves.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 2 or fig. 3 or fig. 4 or fig. 5 or fig. 6 or fig. 7 or fig. 8.

It should be noted that although in the above detailed description several modules or units or sub-units of the apparatus for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units or sub-units described above may be embodied in one module or unit or sub-unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units or sub-units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video composition, comprising:

acquiring keyword information of a first video, counting and sorting the keyword information of the first video, and classifying the keyword information of k before ranking to divide the keyword information of k before ranking into at least one keyword set, wherein k is an integer larger than 0;

processing the keyword information in each keyword set to obtain a behavior matching value of each keyword information and each behavior data to be matched;

determining behavior data to be matched with the maximum behavior matching value as behavior data corresponding to the keyword information;

determining behavior data corresponding to the keyword information in the keyword set as behavior data corresponding to the first video;

acquiring a video composition request of the first video, wherein the video composition request comprises a target keyword;

matching the target keywords with the keyword information in each keyword set, determining the successfully matched keyword set as a target keyword set, and determining at least one piece of behavior data corresponding to the successfully matched keyword set as at least one piece of target behavior data matched with the target keywords, wherein the at least one piece of target behavior data matched with the target keywords is the generalization of behaviors possibly made under the condition indicated by the target keywords;

Determining a plurality of target video segments in the first video, which are matched with the target behavior data;

and splicing the target video clips to obtain a second video.

2. The method of claim 1, wherein determining a plurality of target video segments in the first video that match the target behavior data comprises:

processing the first video and the target behavior data to obtain a plurality of video segments to be sequenced in the first video and the matching degree of each video segment to be sequenced and the target behavior data;

determining a target sorting value of each sorted video fragment according to the matching degree of each video fragment to be sorted;

and sorting the video clips to be sorted according to the target sorting value, determining m video clips to be sorted which are ranked m before as a plurality of target video clips matched with the target behavior data, wherein m is an integer greater than 0.

3. The method of claim 2, wherein determining a target ranking value for each ranked video clip based on a degree of matching for each video clip to be ranked comprises:

obtaining the pop-up comment number of each video segment to be sequenced;

obtaining the playing heat of each video segment to be sequenced according to the pop-up comment number of each video segment to be sequenced;

And carrying out weighted calculation on the matching degree of the plurality of video clips to be sequenced and the playing heat of the video clips to be sequenced to obtain target sequencing values of the plurality of video clips to be sequenced.

4. The method of claim 2, wherein the target behavioral data comprises expression data; processing the first video and the target behavior data, and obtaining a plurality of video segments to be sequenced in the first video includes:

performing face detection on the video frames in the first video to obtain faces to be recognized;

extracting the characteristics of the face to be identified to obtain the characteristics of the face to be identified;

processing the face features to be recognized and the expression data through an expression recognition model to obtain expression similarity of the face features to be recognized and the expression data;

and determining the video frames in which the facial features to be identified with the expression similarity larger than the expression similarity threshold are located as the plurality of video segments to be sequenced.

5. The method of claim 2, wherein the target behavioral data comprises action data; processing the first video and the target behavior data, and obtaining a plurality of video segments to be sequenced in the first video includes:

Performing person detection on video frames in the first video, and obtaining a plurality of video frame sequences according to detection results;

processing the plurality of video frame sequences and the motion data through a motion recognition model to obtain motion characteristics of each video frame sequence;

determining action characteristics of each video frame sequence and action similarity of the action data;

and determining the video frame sequences with the motion similarity larger than the motion similarity threshold as the plurality of video segments to be sequenced.

6. The method of claim 1, wherein counting and ranking keyword information of the first video comprises:

obtaining the occurrence times of keyword information of the first video, wherein the keyword information comprises one or more of pop-up comment information, static comment information and label information;

and ordering the keyword information according to the occurrence times.

7. The method of claim 6, wherein matching the target keyword with the keyword information in each keyword set, and determining a successfully matched keyword set as a target keyword set, and determining at least one behavior data corresponding to the successfully matched keyword set as at least one target behavior data matched with the target keyword comprises:

Calculating the keyword similarity of the target keyword and each keyword information in the keyword set;

determining a keyword set with the maximum keyword similarity as a target keyword set corresponding to the first video;

and determining the behavior data corresponding to the target keyword set as a plurality of target behavior data matched with the target keyword.

8. The method of claim 6, wherein classifying the top k keyword information to divide the top k keyword information into at least one keyword set comprises:

obtaining keyword vectors of the keyword information of k top ranks;

and dividing the keyword information with the similarity between the keyword vectors smaller than a word vector similarity threshold into the same set to obtain the at least one keyword set.

9. A video compositing device, wherein the device is configured to obtain keyword information of a first video, and to count and order the keyword information of the first video; the device comprises:

a keyword classification module configured to classify the keyword information of the k before ranking to divide the keyword information of the k before ranking into at least one keyword set, k being an integer greater than 0;

The apparatus is further configured to: processing the keyword information in each keyword set to obtain a behavior matching value of each keyword information and each behavior data to be matched; determining behavior data to be matched with the maximum behavior matching value as behavior data corresponding to the keyword information; determining behavior data corresponding to the keyword information in the keyword set as behavior data corresponding to the first video;

the request receiving module is configured to acquire a video composition request of the first video, wherein the video composition request comprises a target keyword;

the behavior matching module is configured to match the target keywords with the keyword information in each keyword set, determine the keyword set which is successfully matched as a target keyword set, and determine at least one piece of behavior data corresponding to the keyword set which is successfully matched as at least one piece of target behavior data matched with the target keywords, wherein the at least one piece of target behavior data matched with the target keywords is a generalization of behaviors which can be made under the condition indicated by the target keywords;

a behavior recognition module configured to determine a plurality of target video clips in the first video that match the target behavior data;

And the video synthesis module is configured to splice the plurality of target video clips to obtain a second video.

10. The apparatus of claim 9, wherein the behavior recognition module comprises:

the video segment identification unit is configured to process the first video and the target behavior data to obtain a plurality of video segments to be sequenced in the first video and the matching degree of each video segment to be sequenced and the target behavior data;

the sorting value determining unit is configured to determine a target sorting value of each sorted video fragment according to the matching degree of each video fragment to be sorted;

and the video segment sorting unit is configured to sort the plurality of video segments to be sorted according to the target sorting value, determine m video segments to be sorted which are ranked m before as a plurality of target video segments matched with the target behavior data, and m is an integer greater than 0.

11. The apparatus of claim 10, wherein the ranking value determining unit comprises:

the pop-up comment number subunit is configured to obtain the pop-up comment number of each video clip to be sequenced;

the playing heat subunit is configured to obtain the playing heat of each video segment to be sequenced according to the pop-up comment number of each video segment to be sequenced;

And the sequencing value determining subunit is configured to perform weighted calculation on the matching degree of the plurality of video segments to be sequenced and the playing heat of the video segments to be sequenced to obtain target sequencing values of the plurality of video segments to be sequenced.

12. The apparatus of claim 10, wherein the target behavioral data comprises expression data; wherein the video clip identifying unit includes:

the face detection subunit is configured to perform face detection on the video frames in the first video to obtain faces to be identified;

the feature extraction subunit is configured to perform feature extraction on the face to be identified to obtain the feature of the face to be identified;

the expression similarity calculation subunit is configured to process the face features to be identified and the expression data through an expression identification model to obtain expression similarity of the face features to be identified and the expression data;

the first video segment identification subunit is configured to determine a video frame where the face feature to be identified with the expression similarity greater than the expression similarity threshold is located as the plurality of video segments to be sequenced.

13. The apparatus of claim 10, wherein the target behavior data comprises action data; the video clip identification unit includes:

A person detection subunit configured to perform person detection on video frames in the first video, and obtain a plurality of video frame sequences according to a detection result;

the motion characteristic recognition subunit is configured to process the plurality of video frame sequences and the motion data through a motion recognition model to obtain motion characteristics of each video frame sequence;

a motion similarity calculation subunit configured to determine motion characteristics of each video frame sequence and motion similarity of the motion data;

and the second video segment identification subunit is configured to determine a video frame sequence with the action similarity larger than the action similarity threshold as the plurality of video segments to be sequenced.

14. The apparatus as recited in claim 10, further comprising:

the keyword acquisition module is configured to acquire the occurrence times of keyword information of the first video, wherein the keyword information comprises one or more of pop-up comment information, static comment information and tag information;

and the keyword ordering module is configured to order the keyword information according to the occurrence times.

15. The apparatus of claim 14, wherein the behavior matching module comprises:

A keyword similarity unit configured to calculate keyword similarity of the target keyword and each keyword information in the keyword set;

a keyword set determining unit configured to determine a keyword set having the greatest keyword similarity as a target keyword set corresponding to the first video;

and the behavior matching unit is configured to determine the behavior data corresponding to the target keyword set as a plurality of target behavior data matched with the target keyword.

16. The apparatus of claim 14, wherein the keyword classification module comprises:

a word vector representation unit configured to obtain keyword vectors of the keyword information of the top k of the rank;

and the keyword classification unit is configured to divide the keyword information with the similarity between the keyword vectors smaller than a word vector similarity threshold value into the same set to obtain the at least one keyword set.

17. An electronic device, comprising:

at least one processor;

a storage device for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-8.

18. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.