CN112565825B

CN112565825B - Video data processing method, device, equipment and medium

Info

Publication number: CN112565825B
Application number: CN202011390109.2A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-05-13
Anticipated expiration: 2040-12-02
Also published as: WO2022116888A1; US12094209B2; US20230012732A1; CN112565825A

Abstract

The embodiment of the application provides a video data processing method, a device, equipment and a medium, the method relates to the field of artificial intelligence, and the method comprises the following steps: acquiring video data of a target video requested by a target user, and performing video analysis on the video data to obtain a video segment of the video data; determining a video template associated with a target user, and acquiring template fragments mapped by the video template and template tag sequences corresponding to the template fragments; screening video clips meeting the clip matching conditions from the video clips based on the template clips and the template tag sequences to serve as video material clips of the target video; and pushing the video data and the video material segments to an application client corresponding to the target user so that the application client outputs the video data and the video material segments. By adopting the method and the device, the intelligence and the controllability of the generation of the short video can be enhanced, and the generation efficiency of the short video can be improved.

Description

Video data processing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for processing video data.

Background

With the development of multimedia technology, video has become a main carrier for people to obtain information and enjoy entertainment in daily life. Due to the popularity of various video playing platforms, various short videos (i.e., highlight video highlights) have been derived. It is understood that short video herein refers to video content played on various video playing platforms suitable for viewing in a mobile state and a short leisure state.

However, currently, the generation of the short video requires a lot of manpower and mental effort, for example, in the process of generating the short video, it is often necessary to manually clip materials, manually synthesize the video, manually dub music, and the like, which means that it is difficult to ensure the controllability of the short video generation effect when the short video is generated by a manual interaction manner. In addition, since a large amount of labor time is consumed to synthesize the audio and video in the whole process of generating the short video, the generation efficiency of the short video (i.e., the highlight video collection) is reduced.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a device, and a medium for processing video data, which can enhance the intelligence and controllability of short video generation and can improve the generation efficiency of short videos.

An embodiment of the present application provides a video data processing method, including:

acquiring video data of a target video requested by a target user, and performing video analysis on the video data to obtain a video segment of the video data; one video clip corresponds to one clip attribute tag;

acquiring a user portrait of a target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

based on the template fragments and the template label sequence, screening the video fragments meeting the fragment matching conditions from the video fragments, and taking the video fragments meeting the fragment matching conditions as video material fragments of the target video; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

and pushing the video data and the video material segments to an application client corresponding to the target user so that the application client outputs the video data and the video material segments.

An aspect of an embodiment of the present application provides a video data processing apparatus, including:

the segment generation module is used for acquiring video data of a target video requested by a target user and performing video analysis on the video data to obtain a video segment of the video data; one video clip corresponds to one clip attribute tag;

the template acquisition module is used for acquiring a user portrait of a target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

the material determining module is used for screening video clips meeting the clip matching conditions from the video clips based on the template clips and the template label sequence, and taking the video clips meeting the clip matching conditions as video material clips of the target video; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

and the data sending module is used for pushing the video data and the video material segments to an application client corresponding to the target user so that the application client outputs the video data and the video material segments.

Wherein, the device still includes:

the request receiving module is used for receiving a video playing request sent by an application client; the video playing request is generated by the application client in response to the playing operation executed by the target user aiming at the target video;

and the data searching module is used for extracting the video identifier of the target video from the video playing request, searching the service video data corresponding to the target video in the video service database based on the video identifier, and taking the searched service video data as the video data of the target video in the application client.

Wherein, the fragment generation module comprises:

a model acquisition unit for acquiring video data of a target video requested by a target user and a network identification model associated with the video data;

the device comprises a lens acquisition unit, a lens segmentation unit and a lens segmentation unit, wherein the lens acquisition unit is used for performing lens segmentation on a video sequence corresponding to video data through a video segmentation component to obtain a lens segment associated with the video sequence;

the label determining unit is used for inputting the split mirror segments into the network identification model, and the network identification model carries out attribute analysis on the split mirror segments to obtain segment attribute labels corresponding to the split mirror segments;

and the segment determining unit is used for determining the split-mirror segments with the segment attribute labels as video segments of the video data.

Wherein, the minute mirror acquisition unit includes:

the component acquisition subunit is used for acquiring a video segmentation component for performing the framing processing on the video sequence of the video data, acquiring a first video frame serving as a clustering centroid in the video sequence through the video segmentation component, and creating the framing cluster information of a framing cluster to which the first video frame belongs;

the image matching subunit is used for determining video frames except the first video frame in the video sequence as second video frames, sequentially acquiring the second video frames based on a polling mechanism, and determining the image similarity between the second video frames and the first video frames;

the split mirror creating subunit is used for dividing the second video frame of which the image similarity is greater than or equal to the clustering threshold into the split mirror clusters to which the first video frame belongs if the image similarity between the first video frame and the second video frame is greater than or equal to the clustering threshold;

the matching completion subunit is used for updating the first video frame by using the second video frame with the image similarity smaller than the clustering threshold if the image similarity between the first video frame and the second video frame is smaller than the clustering threshold, creating the image similarity matching information of the image matching cluster to which the updated first video frame belongs, and sequentially performing image similarity matching on the updated first video frame and the unmatched second video frame until the image similarity matching of the video frames in the video sequence is completed, so as to obtain the image similarity matching information of the image matching cluster to which the video frame belongs in the video sequence;

and the split mirror determining subunit is used for determining a split mirror segment associated with the video sequence based on the split mirror cluster information of the split mirror cluster to which the video frame in the video sequence belongs.

Wherein the network identification model comprises at least: a first network model having a first attribute tag extraction function, a second network model having a second attribute tag extraction function, and a third network model having a third attribute tag extraction function;

the tag determination unit includes:

the first analysis subunit is configured to input the partial lens segments into a first network model, perform distance and near view analysis on each partial lens segment in the partial lens segments through the first network model to obtain distance and near view labels of the partial lens segments, use the distance and near view labels of the partial lens segments as first attribute labels output by the first network model, and use the partial lens segments with the first attribute labels as first class partial lens segments;

the face detection subunit is used for inputting the first class of the partial mirror segments into the second network model, and performing face detection on each partial mirror segment in the first class of the partial mirror segments by using the second network model to obtain a face detection result;

the second analysis subunit is configured to, if the face detection result indicates that a face of a target role exists in the first class of partial mirror segments, take a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segment in the first class of partial mirror segment, determine, through the second network model, a role label to which the target role in the second class of partial mirror segment belongs, and determine, as a second attribute label of the second class of partial mirror segment, the role label to which the target role belongs; the target role is one or more roles in the target video;

a third analyzing subunit, configured to determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as third class of partial mirror segments, input the third class of partial mirror segments into a third network model, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by using the third network model to obtain a third attribute tag of the third class of partial mirror segments;

and the label analysis subunit is used for determining a segment attribute label corresponding to each of the partial mirror segments according to the first attribute label of the first partial mirror segment, the second attribute label of the second partial mirror segment and the third attribute label of the third partial mirror segment.

Wherein, the template acquisition module includes:

the behavior extraction unit is used for acquiring a behavior log table of the target user and extracting behavior data information associated with the target user from the behavior log table;

the behavior analysis unit is used for carrying out user image analysis on the behavior data information to obtain a user image for representing the target user, and determining a video template associated with the target user based on the user image; the video template carries a template label sequence formed by template attribute labels of the template fragments; the template fragment is obtained by performing video analysis on the template video; the template video is determined by the behavior data information;

and the template analysis unit is used for acquiring template fragments mapped by the video template and template label sequences corresponding to the template fragments.

Wherein the number of the template fragments is N, and N is a positive integer greater than 1; the template tag sequence comprises N sequence positions, one sequence position corresponds to one template attribute tag, and one template attribute tag corresponds to one template fragment;

the material determining module comprises:

the template tag determining unit is used for acquiring a target template segment from the N template segments, determining the queue position of the target template segment as a target queue position in the template tag sequence, and determining a template attribute tag corresponding to the target queue position as a target template attribute tag;

the tag screening unit is used for screening the segment attribute tags matched with the target template attribute tags from the segment attribute tags corresponding to the video segments, and determining the video segments corresponding to the screened segment attribute tags as candidate video segments;

the segment matching unit is used for performing similarity analysis on each candidate video segment in the candidate video segments and the target template segment to obtain a similarity threshold value of each candidate video segment and the target template segment, determining a maximum similarity threshold value in the similarity threshold values, and determining the candidate video segment corresponding to the maximum similarity threshold value as the target candidate video segment matched with the target template segment;

and the material generation unit is used for determining a target label sequence formed by the segment attribute labels corresponding to the target candidate video segments based on the target queue positions of the target template segments in the template label sequence, and determining the video material segments meeting the segment matching conditions according to each target candidate video segment associated with the target label sequence.

Wherein, the material generation unit includes:

the video splicing subunit is configured to perform video splicing processing on each target candidate video segment associated with the target tag sequence to obtain spliced video data associated with the N template segments;

and the material synthesis subunit is used for acquiring template audio data associated with the N template fragments, and performing audio and video combination processing on the template audio data and the spliced video data through the audio and video synthesis component to obtain video material fragments meeting the fragment matching conditions.

responding to a playing operation executed by a target user aiming at a target video in an application client, and acquiring video data of the target video and video material segments associated with the target video from a server; the video material segments are obtained by screening the video segments of the target video by the server according to the template segments of the video template and the template label sequences corresponding to the template segments; the video clip is obtained by video analysis of the video data by the server; the video template is determined by the server based on a user representation of the target user; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

and outputting the video data and the video material segments in an application display interface of the application client.

the data acquisition module is used for responding to the play operation executed by a target user aiming at a target video in the application client, and acquiring the video data of the target video and video material segments related to the target video from the server; the video material segments are obtained by screening the video segments of the target video by the server according to the template segments of the video template and the template label sequences corresponding to the template segments; the video clip is obtained by video analysis of the video data by the server; the video template is determined by the server based on a user representation of the target user; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

and the data output module is used for outputting the video data and the video material segments in an application display interface of the application client.

Wherein, the data acquisition module includes:

the request sending unit is used for responding to the playing operation executed by the target user aiming at the target video in the application client, generating a video playing request for requesting to play the target video and sending the video playing request to the server; the video playing request carries a video identifier of a target video; the video identification is used for indicating the server to acquire video data of a target video requested to be played by a target user;

the data receiving unit is used for receiving video data returned by the server based on the video playing request and video material segments associated with the target video; the video material segments are obtained by performing video analysis and video matching on video data according to a video template when the server determines the video template according to the user image of the target user, wherein the user image is determined by user behavior information of the target user in an application client.

Wherein, the data output module includes:

the video playing unit is used for determining a video playing interface for playing video data in an application display interface of the application client and playing the video data in the video playing interface;

and the material output unit is used for responding to the trigger operation aiming at the application display interface and playing the video material segments in the application display interface.

In one aspect, the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the method provided by the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the method as provided by the embodiments of the present application.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided by the embodiment of the application.

In this embodiment of the application, when the computer device obtains video data of a certain video requested by a target user, video analysis may be performed on the video data to obtain one or more video segments of the video data. It can be understood that the video analysis related to the embodiment of the present application mainly includes: video split-mirror and attribute analysis. The video segmentation mainly means that the video data can be divided into one or more segment segments, so that the server can further perform attribute analysis on the segment content of each segment to obtain the segment attribute tag of each segment, and thus the segment segments with the segment attribute tag are collectively referred to as the aforementioned video segments, and it should be understood that one video segment may correspond to one segment attribute tag. Therefore, one or more video clips carrying the clip attribute tags can be obtained quickly by performing video analysis (for example, video split-mirror analysis, attribute analysis and the like) on the video data in the embodiment of the application. Therefore, for the video clips, when one or more video templates are accurately determined according to the user portrait, the video clips can be intelligently screened respectively according to the template tag sequences of the video templates, so that the video clips with the video playing effect similar to that of the video templates can be quickly obtained, and further the video material clips can be quickly synthesized (for example, short videos which can be displayed for target users can be quickly obtained).

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

fig. 2 is a schematic view of a scenario for performing data interaction according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a scene for querying video data according to an embodiment of the present application;

fig. 5 is a schematic view of a scene for performing a split mirror process according to an embodiment of the present application;

FIG. 6 is a schematic flowchart of extracting a segment attribute tag according to an embodiment of the present application;

fig. 7 is a schematic view of a scene for acquiring a video template according to an embodiment of the present application;

fig. 8a is a schematic view of a scene for performing video analysis on a template video according to an embodiment of the present application;

fig. 8b is a schematic view of a scene for performing video analysis on a target video according to an embodiment of the present application;

fig. 9 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

fig. 10 is a schematic flowchart of generating a video material segment according to an embodiment of the present application;

FIG. 11 is a schematic flowchart of front-end and back-end interaction provided by an embodiment of the present application;

fig. 12a is a schematic view of a scene of an output video material segment according to an embodiment of the present application;

fig. 12b is a schematic view of a scene for updating a video material segment according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 16 is a video data processing system according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be appreciated that Artificial Intelligence (AI) is a theory, method, technique, and application that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV for short) is a science for researching how to make a machine look, and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Specifically, please refer to fig. 1, where fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a service server 2000 and a user terminal cluster. The user terminal cluster may specifically include one or more user terminals, and here, the number of the user terminals in the user terminal cluster is not limited. As shown in fig. 1, the plurality of user terminals may specifically include a user terminal 3000a, a user terminal 3000b, user terminals 3000c, …, and a user terminal 3000 n. The user terminal 3000a, the user terminal 3000b, the user terminals 3000c, …, and the user terminal 3000n may be directly or indirectly connected to the service server 2000 through wired or wireless communication, so that each user terminal may perform data interaction with the service server 2000 through the network connection.

The service server 2000 shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform.

It should be understood that each user terminal in the user terminal cluster shown in fig. 1 may be integrally installed with an application client, and when the application client runs in each user terminal, data interaction may be performed with the service server 2000 shown in fig. 1. The application client may be understood as an application capable of loading and displaying video data, and for example, the application client may specifically include: an in-vehicle client, a smart home client, an entertainment client (e.g., a game client), a multimedia client (e.g., a video client), a social client, and an information client (e.g., a news client). For convenience of understanding, in the embodiment of the present application, one user terminal may be selected from the plurality of user terminals shown in fig. 1 as a target user terminal, and the target user terminal may include: the intelligent terminal comprises an intelligent terminal with a video data loading function, such as a smart phone, a tablet computer, a notebook computer and an intelligent television. For example, the embodiment of the present application may use the user terminal 3000a shown in fig. 1 as a target user terminal.

For example, when a user Y (i.e., a target user) needs to play a certain video (e.g., a video in which the user Y is interested) in the target user terminal, the target user terminal may send a video playing request to the service server 2000 shown in fig. 1 in response to a trigger operation for the video, so that the service server 2000 may find video data of the video in the video service database based on the video playing request, and may return the video data and one or more video material segments (e.g., video blooms of the video, etc.) associated with the video data to the target user terminal, so as to perform playing processing on the video data of the video requested to be played by the user Y in the target user terminal. Optionally, at the same time, the target user terminal may also display the received video material segments together when playing the video data. It can be understood that, the video material segments herein may be obtained by the service server 2000 screening the video segments of the target video according to the template segments of the selected video template and the template tag sequences corresponding to the template segments; in addition, it is understood that the video segments herein may be obtained by performing video analysis on the video data by the service server 2000; it should be understood that the video template herein may be determined by the service server 2000 based on a user representation of the user Y (i.e., the target user).

It is to be understood that, in the embodiment of the present application, videos (e.g., dramas or short videos, etc.) selected by the user Y (i.e., the target user) in the application client (e.g., the video client K) to fit the interest of the user Y may be collectively referred to as target videos.

The video material segments in the embodiment of the present application may be intelligently generated by the service server 2000 according to the template segments and the template tag sequences of the video templates. For example, the service server 2000 may intelligently generate one or more video material segments of a target video (e.g., a tv series S1) selected by the user Y in the target user terminal through the video data processing method according to the embodiment of the present application. The process of generating the video material segments refers to a process in which the service server 2000 can perform tag matching and content similarity matching on tag information (i.e., segment attribute tags) of a video segment of the tv series S1 and tag information (i.e., template attribute tags) of template segments of template videos (e.g., video M), and then can screen, according to results of the tag matching and the content similarity matching, video segments having a similar video playing effect to each template segment of the template segments of the video M from the video segments of the tv series S1, so that video material segments similar to the video M can be intelligently generated according to spliced video data formed by the screened video segments and template audio data of the template segments.

It should be understood that the network framework is applicable to the field of artificial intelligence (i.e., the AI field), and the service scenes corresponding to the AI field may be video classification scenes, video recommendation scenes, etc., and specific service scenes will not be enumerated here.

The video classification scenario herein mainly refers to that a computer device (e.g., the service server 2000) can store video clips of the same video in a first service database by using the video data processing method. For example, after the computer device generates a video material segment based on a certain video template (e.g., video material segment a1 generated based on video template B1 and video material segment a2 generated based on video template B2), video material segment a1 and video material segment a2 may be added to a corresponding short video recommendation database, which may include at least a first service database and a second service database. Where the first traffic database may be used to store one or more video material segments associated with the same video. For example, if the video material segment a1 and the video material segment a2 both belong to a video segment of the same video (e.g., video W), the video material segment a1 and the video material segment a2 may be added to the first service database corresponding to the video W. Optionally, if the video material segment a1 and the video material segment a2 belong to video segments of different videos, for example, if a target video corresponding to the video material segment a1 is a video W1 requested by the user Y1, the video material segment a1 may be added to a first service database corresponding to the video W1; if the target video corresponding to the video material segment a2 is the video W2 requested by the user Y2, the video material segment a2 may be added to the first service database corresponding to the video W2.

Where the second traffic database may be used to store one or more video material segments associated with the same video template. This means that embodiments of the present application can add video clips that use the same video template to the second service database among video material clips of different videos. For example, after the computer device generates a video material segment based on a certain video template (e.g., generates a video material segment a based on a video template B), the computer device may also add the video material segment a to a second service database corresponding to the video template B. For example, if the video template B is an emoticon, the video material segment a may be added to the second service database corresponding to the emoticon. For another example, if the video template B is a story line collection, the video material segment a may be added to the second service database corresponding to the story line collection. For another example, if the video template B is a person mixed-cut collection class, the video material segment a may be added to the second service database corresponding to the person mixed-cut collection class.

Further, it should be understood that, in the above video recommendation scenario, after intelligently generating video material segments of a target video requested by a target user (e.g., the video material segment a1 generated based on the video template B1 and the video material segment a2 generated based on the video template B2) by the above video data processing method, a computer device (e.g., the service server 2000) may also add the video segments of the same video W (i.e., the video segments in the first service database, e.g., the video material segment a1 and the video material segment a2) to a short video recommendation list (e.g., the short video recommendation list 1) to intelligently push the video segments in the short video recommendation list 1 to the target user, so that, after the target user views the video W (i.e., the target video) in the application client, the video clips in the short video recommendation list 1 can be played for the target user in a traversal mode in the application client intelligently. For example, when the target user finishes viewing the video W in the target user terminal, the computer device (e.g., the service server 2000) may further output the video material segment a1 in the plurality of video material segments in the short video recommendation list 1 to the application client, so as to implement smart playing of the video material segment a1 in the application client. It will be appreciated that the video material segment a1 has a similar video playback effect to the template segment mapped by the video template B1 described above.

Optionally, after intelligently generating video material segments (e.g., the video material segment a1 and the video material segment A3 generated based on the video template B1) by the video data processing method, the computer device (e.g., the service server 2000) may also add the video segments using the same video template B1 (i.e., the video segments in the second service database, e.g., the video material segment a1 and the video material segment A3) to another short video recommendation list (e.g., the short video recommendation list 2) to intelligently push the video segments in the short video recommendation list 2 to the target user. Thus, when the target user watches the video clips (e.g., video material clip a 1) in the short video recommendation list 2 in the application client, and then intelligently plays the video clips in the short video recommendation list 2 for the target user in the application client, for example, when the target user finishes watching the video material clip a1 in the target user terminal, the computer device (e.g., the service server 2000) can also output other video material clips (e.g., video material clip A3) in the short video recommendation list 2 to the application client to realize intelligent playing of the video material clip A3 in the application client, it can be understood that, since the video material clip A3 and the video template B1 use the same video template, when the video material clip A3 and the video template B1 are played in the application client, the target user will be presented with a video playback effect similar to the template fragment mapped by template fragment B1 described above.

For easy understanding, please refer to fig. 2, and fig. 2 is a schematic diagram of a scenario for performing data interaction according to an embodiment of the present application. For convenience of understanding, in the embodiment of the present application, the user terminal 3000a shown in fig. 1 is taken as the user terminal X, so as to illustrate a specific process of recommending a video material segment for a target user when the service scene is a video recommendation scene.

As shown in fig. 2, the video recommendation interface 200a may include a plurality of recommended video data, where the plurality of recommended video data may specifically include the video data 20a, the video data 20b, the video data 20c, and the video data 20d shown in fig. 2. It is understood that the video data 20a, the video data 20b, the video data 20c, and the video data 20d displayed in the video recommendation interface 200a may be collectively referred to as recommended video data in the embodiments of the present application.

As shown in fig. 2, when a target user needs to play certain recommended video data (e.g., video data 20b), the video data 20b selected by the target user from the video recommendation interface 200a may be collectively referred to as a target video in the application display interface. At this time, the user terminal may send a video play request to the server shown in fig. 2 in response to a play operation for the target video in the application display interface. At this time, the server may respond to the video playing request to output the video playing interface corresponding to the target video at the application client, for example, the video playing interface corresponding to the video data 20b may be output at the application client, and the video playing interface corresponding to the video data 20b may be the video playing interface 200b shown in fig. 2. The application display interface may include a video playing interface 200b for playing a target video, and may further include a short video recommendation list for displaying video material segments, where the short video recommendation list may include video material segments associated with the target video.

The server may obtain the video identifier of the target video from the video playing request when receiving the video playing request sent by the target user through the user terminal, and query the video data of the target video in the video service database according to the video identifier. After querying the video data of the target video, the server may perform the above-mentioned video analysis on the video sequence of the video data to obtain video segments of the video data, where the video segments may specifically include the video segment 100a, the video segment 100b, …, and the video segment 100k shown in fig. 2, and each of the video segments may correspond to one segment attribute tag.

Further, the server may obtain, based on the user portrait of the target user, a video template that fits the viewing interest of the target user, and may further obtain template segments mapped by the video template and template tag sequences corresponding to the template segments, so that video segments that match each template segment (i.e., video segments that satisfy segment matching conditions) may be screened from the video segments according to the template tag sequences, and further video material segments may be obtained based on the screened video segments that satisfy segment matching conditions. Therefore, the embodiment of the application can acquire the video segments having the same tag sequence characteristics as the template segment from the video segments as much as possible, and further can fill the video material segments (for example, one or more short videos of the target video can be obtained) according to the same tag sequence (i.e., the template tag sequence), so that the user terminal can output the video material segments and the video data to the application client. It is to be understood that a video template may correspond to one or more video material segments, for example, the number of video material segments with the same tag sequence feature screened from the video segments of the target video will not be limited herein.

For convenience of understanding, in the embodiment of the present application, one video template corresponds to one video material segment as an example, and when the server determines that there are a plurality of (for example, N) video templates matching the viewing interest of the target user, the embodiment of the present application may also be used to generate N video templates of N video material segments collectively referred to as video templates. It should be understood that, a specific implementation manner of intelligently generating other video material segments through the N video templates may be referred to together with the description of the specific process of intelligently generating the video material segments, and details will not be further described here.

For a specific implementation manner of recommending a video material segment in a target user terminal, reference may be made to the following embodiments corresponding to fig. 3 to 12 b.

Further, please refer to fig. 3, wherein fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application. As shown in fig. 3, the method may be executed by an application client, or may be executed by a server, or may be executed by both the application client and the server, where the application client may be an application client running in the user terminal X in the embodiment corresponding to fig. 2, and the server may be a server in the embodiment corresponding to fig. 2. For the convenience of understanding, the embodiment is described as an example in which the server performs the method to illustrate a specific process of generating the video material segment corresponding to the target video based on the video template in the server. Wherein, the method at least comprises the following steps S101-S104:

step S101, acquiring video data of a target video requested by a target user, and performing video analysis on the video data to obtain a video segment of the video data;

specifically, the server may obtain video data of a target video requested by a target user and a network identification model associated with the video data. Further, the server may perform a mirror segmentation process on the video sequence corresponding to the video data through the video segmentation component, so as to obtain a mirror segment associated with the video sequence. Further, the server may input the split mirror segments into the network identification model, and the network identification model performs attribute analysis on the split mirror segments to obtain segment attribute tags corresponding to the split mirror segments. Further, the server may determine the split-mirror segment with the segment attribute tag as a video segment of the video data. Wherein one video clip may correspond to one clip attribute tag.

It should be understood that the server may receive the video playing request sent by the application client before obtaining the video data of the target video requested by the target user. The video playing request is generated by the application client in response to the playing operation executed by the target user aiming at the target video. Further, the server may extract the video identifier of the target video from the video playing request, search the service video data corresponding to the target video in the video service database based on the video identifier, and use the searched service video data as the video data of the target video in the application client.

For ease of understanding, please refer to fig. 4, and fig. 4 is a schematic view illustrating a scene of querying video data according to an embodiment of the present application. As shown in fig. 4, the application display interface 400a may be the application display interface 200a in the embodiment corresponding to fig. 2, when the target user performs a trigger operation (i.e., a play operation) with respect to the video data 40b in the application display interface 400a of the application client, the application client may regard the video data 40b as a target video, and sends a video play request carrying the video identification of the video data 40b to the server, and the server may receive the video playing request sent by the application client, obtain the video identifier carried by the video playing request and related to the video data 40b, and searching the service video data corresponding to the video identifier in the video service database corresponding to the application client based on the video identifier, and using the searched service video data as the video data corresponding to the video data 40 b.

It can be understood that the target video may be a long video such as a variety program, a movie, a television show, or a short video captured from the long video, and the like, which is not limited in the present application.

It should be understood that, the specific process of the server performing the mirror segmentation processing on the video sequence corresponding to the video data through the video segmentation component to obtain the mirror segments associated with the video sequence may be described as follows: the server may, when obtaining a video segmentation component for performing a mirroring process on a video sequence of the video data, obtain, by the video segmentation component, a first video frame serving as a cluster centroid in the video sequence, and create mirroring cluster information of a mirroring cluster to which the first video frame belongs (it may be understood that the mirroring cluster information here may be an identifier configuring a corresponding mirroring cluster). Further, the server may determine a video frame other than the first video frame as a second video frame in the video sequence, and may sequentially acquire the second video frame based on a polling mechanism to determine an image similarity of the second video frame with the first video frame. Further, if the image similarity between the first video frame and the second video frame is greater than or equal to the clustering threshold, the server may divide the second video frame whose image similarity is greater than or equal to the clustering threshold into the mirrored clusters to which the first video frame belongs. Further, if the image similarity between the first video frame and the second video frame is smaller than the clustering threshold, the server may update the first video frame with the second video frame whose image similarity is smaller than the clustering threshold, and create the image similarity matching information of another image similarity cluster to which the updated first video frame belongs, and further may sequentially perform image similarity matching between the updated first video frame and the unmatched second video frame, until the video frames in the video sequence all complete the image similarity matching, the image similarity matching information of the image similarity cluster to which the video frame in the video sequence belongs may be obtained (i.e., the image similarity matching information of each video frame in the video sequence may be obtained by partitioning). Further, the server may determine a mirrored segment associated with the video sequence based on the mirrored cluster information of the mirrored cluster to which the video frames in the video sequence belong.

It can be understood that the image similarity matching refers to calculating the similarity of the content between two images, and may obtain the image similarity for determining the similarity of the image content, where the greater the image similarity is, the more similar the two images are, and the smaller the image similarity is, the more dissimilar the two images are. The similarity of the contents between the two images can be measured by using different methods, for example, the cosine similarity can represent the images as a vector, and the similarity of the two images is represented by calculating the cosine distance between the vectors; the histogram can describe the global distribution of colors in an image, and is an entry-level image similarity calculation method; the structural similarity measurement is a full-reference image quality evaluation index, and measures the image similarity from three aspects of brightness, contrast and structure. It should be understood that the present application does not limit the method specifically used in image similarity matching.

For easy understanding, please refer to fig. 5, and fig. 5 is a schematic view of a scenario for performing the split mirror processing according to an embodiment of the present application. The video sequence shown in fig. 5 may include a plurality of video frames, and specifically may include n video frames shown in fig. 2, where n may be a positive integer greater than 1, and the n video frames may specifically include: video frame 10a, video frame 10b, video frame 10c, video frame 10d, …, video frame 10 n. It should be understood that image similarity between video frames in the video sequence can be calculated through a clustering algorithm, so that the video frames in the video sequence can be divided into different clusters (i.e. mirror clusters) based on the calculated image similarity between the video frames. For example, k clusters (i.e., k mirrored clusters) shown in fig. 5 can be obtained by the clustering algorithm, and the k clusters can specifically include the cluster 20a, the cluster 20b, the cluster …, and the cluster 20k shown in fig. 5. It can be understood that each of the k clusters shown in fig. 5 may include at least one video frame.

Specifically, in the video sequence shown in fig. 5, the first video frame (i.e., video frame 10a) in the video sequence may be referred to as a first video frame that can be used as a clustering centroid (i.e., clustering centroid 1), and video frames other than video frame 10a in the video sequence may be determined as second video frames, and the second video frames (i.e., video frame 10b, video frames 10c, …, and video frame 10n) may be sequentially acquired based on a polling mechanism, so as to sequentially calculate the image similarity between the first video frame and the second video frame. The method and the device can create a mirror cluster (i.e., mirror cluster 1) to which the clustering centroid 1 belongs, further perform image similarity matching on the video frame 10b and the video frame 10a, and when the image similarity (e.g., similarity 1) between the video frame 10b and the video frame 10a is greater than or equal to a clustering threshold, divide the video frame 10b corresponding to the similarity 1 into the mirror cluster (i.e., mirror cluster 1) to which the video frame 10a belongs. Similarly, the present application may divide the video frame 10c into the mirror cluster (i.e., mirror cluster 1) to which the video frame 10a belongs.

Further, since the video frame 10d is the next video frame of the video frame 10c, the present application may perform image similarity matching between the video frame 10d and the video frame 10a, when the image similarity (for example, similarity 2) between the video frame 10d and the video frame 10a is smaller than the clustering threshold, update the first video frame according to the video frame 10d, that is, the video frame 10d corresponding to the similarity 2 is used as a new clustering centroid (that is, clustering centroid 2), and create a mirror cluster (that is, mirror cluster 2) to which the clustering centroid 2 belongs, and further, sequentially acquire the second video frame (that is, video frames 10e, …, and video frame 10n) based on the polling mechanism, so as to sequentially calculate the image similarity between the updated first video frame and the second video frame. The video frame 10e may be divided into the mirror cluster (i.e., the mirror cluster 2) to which the video frame 10d belongs.

Wherein, it can be understood that, after acquiring cluster centroid 1 and cluster centroid 2, the same method can be used to acquire cluster centroid 3, cluster centroid 4, and cluster centroid …, and cluster centroid k. Similarly, after obtaining the mirror cluster 1 (i.e., cluster 20a) and the mirror cluster 2 (i.e., cluster 20b), the same method can be used to obtain the mirror cluster 3 (i.e., cluster 20c), the mirror cluster 4 (i.e., cluster 20d), …, and the mirror cluster k (i.e., cluster 20 k). At this time, the

video frames

10a, 10b, …, and 10n in the video sequence have all completed image similarity matching.

Therefore, by performing clustering processing (i.e. split-mirror processing) on the video frames in the video sequence shown in fig. 5, a plurality of clustering clusters (i.e. split-mirror clusters) associated with the video sequence can be obtained, so that the video frames in each clustering cluster can form a split-mirror segment, and then k split-mirror segments shown in fig. 5 can be obtained. For example, the video frame 10a, the video frame 10b, and the video frame 10c in the cluster 20a may form a mirror segment corresponding to the mirror cluster 1 (i.e., mirror segment 1), the video frame 10d and the video frame 10e in the cluster 20b may form a mirror segment corresponding to the mirror cluster 2 (i.e., mirror segment 2), and …, and the video frame 10(n-2), the video frame 10(n-1), and the video frame 10n in the cluster 20k may form a mirror segment corresponding to the mirror cluster k (i.e., mirror segment k).

It should be appreciated that the video segmentation component that divides the video sequence corresponding to the target video into a plurality of segmented-mirror segments may be a pyscenedetect open-source code library, which is a tool that automatically segments the video data into individual segments, wherein the selection of the first video frame (cluster centroid) may not be limited to the above manner. It can be understood that the method for dividing the video sequence corresponding to the target video into the plurality of split-mirror segments may also be a drum point identification method, that is, acquiring audio data of the target video, identifying a drum point in the audio data, and determining a position of the drum point in the video data of the target video according to a position of the drum point in the audio data, so as to divide the video sequence of the video data. The method for dividing the video sequence into the plurality of split-mirror segments can be in other modes, and the application does not limit the split-mirror method specifically used for video split-mirror.

It is understood that the network identification model at least comprises: the network model comprises a first network model with a first attribute label extraction function, a second network model with a second attribute label extraction function and a third network model with a third attribute label extraction function. It should be understood that the server may input the split mirror segments into the first network model, perform distance and near view analysis on each of the split mirror segments through the first network model to obtain distance and near view tags of the split mirror segments, use the distance and near view tags of the split mirror segments as the first attribute tags output by the first network model, and use the split mirror segments with the first attribute tags as the first class split mirror segments. Further, the server may input the first classification mirror segments into a second network model, and the second network model performs face detection on each of the first classification mirror segments to obtain a face detection result. Further, if the face detection result indicates that the face of the target role exists in the first class of partial mirror segments, the server may use a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segments in the first class of partial mirror segments, determine a role label to which the target role in the second class of partial mirror segments belongs through the second network model, and determine the role label to which the target role belongs as a second attribute label of the second class of partial mirror segments. The target roles are one or more roles in the target video. Further, the server may determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as a third class of partial mirror segments, input the third class of partial mirror segments into a third network model, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by the third network model to obtain a third attribute tag of the third class of partial mirror segments. Further, the server may determine, according to the first attribute tag of the first class of partial mirror segments, the second attribute tag of the second class of partial mirror segments, and the third attribute tag of the third class of partial mirror segments, a segment attribute tag corresponding to each partial mirror segment in the partial mirror segments.

It can be understood that the first network model may be a distant view recognition model, the second network model may be a face recognition model, and the third network model may be a scene recognition model, based on which the first network model, the second network model, and the third network model may also be an expression recognition model, an action recognition model, and the like. It should be understood that through the network recognition model trained in advance, the basic analysis capability of the target video can be improved, and then the video segment corresponding to the target video can be obtained quickly.

It can be understood that the face detection module may perform face detection on the partial mirror segments to obtain a face detection result, and further determine a role label corresponding to a face in the face detection result through the face recognition module, where the face detection module and the face recognition module may be collectively referred to as a second network module. The face detection and the face recognition can be collectively referred to as image detection, the image detection indicates that a machine learning technology can learn annotation sample data (namely, the corresponding relation between a plurality of annotation frames and label pairs in an image) to obtain a mathematical model, parameters of the mathematical model can be obtained in the learning and training process, the parameters of the mathematical model are loaded during identification and prediction, a prediction frame of a real object label existing in an input sample and the probability that the prediction frame belongs to a certain real object label in a specified range are calculated, and then the real object label with the maximum probability can be used as a label corresponding to the prediction frame.

The method includes the steps of directly inputting a sub-lens segment into a far and near view identification model, obtaining a far and near view label (i.e., a first attribute label) corresponding to the sub-lens segment, directly inputting the sub-lens segment into a scene identification model, and obtaining a scene label (i.e., a third attribute label) corresponding to the sub-lens segment. Before the split mirror segments are input into the third network model, the target roles of the split mirror segments need to be retrieved in advance, that is, the target roles of the split mirror segments can be input into the second network model in advance, and the feature vectors of the target roles are extracted through the second network model, so that when the role labels of the split mirror segments are determined, all video frames in the split mirror segments can be extracted when the split mirror segments are input into the second network model, the videos are subjected to face detection, the detected face feature vectors can be compared with the feature vectors of the target roles, if the similarity result of the feature vector comparison is greater than a threshold value, the face is considered to be the target role, and the role labels of the target roles are used as the role labels (namely, second attribute labels) of the split mirror segments where the detected face is located.

It should be understood that, with the aid of the network identification models (i.e., the first network model, the second network model, and the third network model), the present application may directly input the partial mirror segments into the first network model (i.e., the above-mentioned far and near identification model) without knowing the label information of the partial mirror segments (i.e., without knowing any image noise information in advance), so as to obtain the first attribute labels corresponding to the partial mirror segments (i.e., automatically perform sample noise degree prediction according to the model to give a new label for each partial mirror segment), and thus the partial mirror segments with the first attribute labels may be collectively referred to as first class partial mirror segments. It can be understood that, after the first class mirror segment is obtained, the new label can be further automatically fed back to the subsequent model for learning, so as to avoid the network recognition model from falling into local optimality through dynamic noise prediction and processing, and ensure that model learning is performed in a direction with better recognition effect.

For example, the first class partial mirror segments may be further input to the second network model, so that the second network model may perform face detection and face recognition on each of the first class partial mirror segments, and further may select all the first class partial mirror segments including the face of the target role from the first class partial mirror segments. It is understood that the selected first class of partial mirror segments of the face including the target character may be collectively referred to as a second class of partial mirror segments in the embodiments of the present application. In addition, it can be understood that the second network model can also be used to output a role tag to which a target role in each second partial mirror segment belongs, and based on this, the role tags to which the target role belongs in the second partial mirror segments can be collectively referred to as second attribute tags of the second partial mirror segments. It should be understood that the target character may be one or more characters in the target video, and the number of target characters is not limited herein. Further, in the embodiment of the present application, other segments except the second class of segments may be collectively referred to as a third class of segments in the first segment, and then the third class of segments may be input into a third network model (i.e., the scene identification model) to obtain a third attribute tag corresponding to the third class of segments. Therefore, the label information of the lens segments can be corrected in real time through the trained network identification model, and the segment attribute label of each lens segment can be accurately obtained according to the first attribute label, the second attribute label and the third attribute label.

For easy understanding, please refer to fig. 6, and fig. 6 is a schematic flowchart illustrating a process of extracting a segment attribute tag according to an embodiment of the present application. The video data shown in fig. 6 may be the video data of the target video, and then the specific process of obtaining the label information of the split-mirror segment may be described as follows: the method includes the steps of performing video segmentation on a video sequence of video data to obtain k segmented segments, and further inputting the segmented segments into a network identification model to obtain label information of each segmented segment under the network identification model, where the network identification model may be a scene identification model (i.e., a third network model), a distance and near scene identification model (i.e., a first network model), a face detection model and a face identification model (i.e., a second network model) shown in fig. 6.

It is to be understood that, as shown in fig. 6, after k segments are input into the far and near view identification model, the obtained far and near view label (i.e. first attribute label) corresponding to each segment may be: { partial mirror 1: x1, split mirror 2: x2, …, partial mirror k: xk, where x1 denotes the distance label corresponding to the partial mirror 1 as x1, x2 denotes the distance label corresponding to the partial mirror 2 as x2 and …, and xk denotes the distance label corresponding to the partial mirror k as xk. The far and near view tag may include, but is not limited to: long shot, close-up, etc. Here, the sub-mirror 1, the sub-mirror 2, …, and the sub-mirror k may be the sub-mirror segment 1, the sub-mirror segment 2, …, and the sub-mirror segment k in the embodiment corresponding to fig. 5.

It can be understood that, as shown in fig. 6, after k segments are input into the distance view recognition model, k segments may be input into the face detection model and the face recognition model, and after k segments are input into the face detection model and the face recognition model, the obtained role label (i.e., the second attribute label) corresponding to each segment may be: { partial mirror 1: y1, split mirror 2: y2, split mirror 4: y4, …, partial mirror k-1: yk-1, where y1 indicates that the role label corresponding to the split mirror 1 is y1, where y2 indicates that the role label corresponding to the split mirror 2 is y2, where y4 indicates that the role label corresponding to the split mirror 4 is y4, …, and where yk-1 indicates that the role label corresponding to the split mirror k-1 is yk-1. The role labels may include, but are not limited to: single person, double person, etc.; the role labels may also include, but are not limited to: male one, male two, female one, female two, girl A, boy B, etc. Here, the partial mirror 3, the partial mirror 5, …, and the partial mirror k do not include a character tag.

It can be understood that, as shown in fig. 6, after k segments of the images are input into the face detection model and the face recognition model, segments of the images for which no detection or recognition result is obtained (i.e. no role label is included) may be input into the scene recognition model, and after segments of the images for which no detection or recognition result is obtained are input into the scene recognition model, the obtained scene label (i.e. the third attribute label) corresponding to each image may be: { partial mirror 3: z3, split mirror 5: z5, …, partial mirror k: zk, where z3 denotes that the scene label corresponding to the split mirror 3 is z3, z5 denotes that the scene label corresponding to the split mirror 5 is z5 or …, and zk denotes that the scene label corresponding to the split mirror k is zk. The scene tag may include, but is not limited to: natural scenes, indoor scenes, character buildings, bamboo groves, riversides, amusement parks, and the like.

It should be understood that for a certain one of the k segments, the far and near scene label and the role label, or the far and near scene label and the scene label of the segment may be used to collectively describe the segment attribute label of the segment. For example, for a segment 1 of k segments, the segment attribute tag and the role tag of the segment 1 may be used to describe the segment attribute tag of the segment 1 together (i.e. segment attribute tag 1), for example, if the far and near view tag corresponding to the segment 1 is a far view (i.e. x1 is a far view), and the role tag corresponding to the segment 1 is a male one (i.e. y1 is a male one), then the segment attribute tag 1 corresponding to the segment 1 may be: { long shot, one man }.

Step S102, acquiring a user portrait of a target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

specifically, the server may obtain a behavior log table of the target user, and extract behavior data information associated with the target user from the behavior log table. Further, the server may perform user portrait analysis on the behavior data information to obtain a user portrait for characterizing the target user, and determine a video template associated with the target user based on the user portrait. The video template can carry a template tag sequence formed by template attribute tags of template fragments, the template fragments are obtained by performing video analysis on the template video, and the template video is determined by behavior data information. Further, the server may obtain a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment. It can be understood that, in the embodiment of the present application, behavior logs of different users in an application client, which are acquired by a server within a target duration, may be collectively referred to as a behavior log table.

It can be understood that the behavior data information is used to record behavior interaction data (access, browse, search, click, etc.) generated each time the target user accesses the application client, where the behavior interaction data may specifically include the type of the video accessed by the target user, the time of browsing the video, the number of times of browsing the video, the record of searching the video, the number of times of clicking the video, and videos collected, recommended videos, favorite videos, purchased videos, and videos of coins that are collected by the target user.

For ease of understanding, please refer to fig. 7, and fig. 7 is a schematic view of a scene for acquiring a video template according to an embodiment of the present application. The log management system 70 shown in fig. 7 may specifically include a plurality of databases, and the plurality of databases may specifically include the database 70a, the databases 70b, …, and the database 70n shown in fig. 7. This means that the log management system 10 can be used to store a log of the behaviour of different users in an application client. For example, database 70a may be used to store a behavior log of user Y1 (not shown), database 70b may be used to store a behavior log of user Y2 (not shown), …, and database 70n may be used to store a behavior log of user Yn (not shown).

As shown in fig. 7, when the target user is the user Y1 (i.e., the target user), the server may obtain a behavior log table of the target user within the target duration from the database 10a, and may further obtain behavior data information from the behavior log table. It should be understood that after the server obtains the behavior data information of the target user, the server may perform user portrait analysis on the behavior data information within the target duration to obtain a user portrait for characterizing the target user.

The user portrait can include the favorite degree of a target user to a certain video type, and the server can further select a video template of the video type as a video template associated with the target user; similarly, the user profile may include a favorite degree of the target user with respect to a certain video, and the server may further select a video template corresponding to the certain video as the video template associated with the target user. It is to be understood that the template data corresponding to the video template herein may be data having the same video type as the video data of the target video, for example, when the target video is an animation, a video template associated with the target video may be selected from video templates of the animation class, and for example, when the target video is a real-person drama, a video template associated with the target video may be selected from video templates of the real-person drama class, so that an optimal video template may be selected for the target video, and the display effect of the video material segments is improved.

It is to be understood that the log management system 70 shown in fig. 7 may establish a behavior log table for a target user accessing the application client in a single behavior recording period (for example, day is a unit of the behavior recording period), for example, the log management system 70 may establish a behavior log table for the target user when detecting that the target user accesses the application client for the first time in the day, at which time, an access time stamp (for example, time T1) of the current access to the application client is recorded in the behavior log table, which means that there is no other behavior interaction data in the behavior log table before the current time T1. Further, the log management system 70 may add the behavior log table (e.g., behavior log table 1) established for the target user to a corresponding database (e.g., database 10a shown in fig. 7) for storage when the current behavior recording period reaches the recording period threshold. Similarly, when the access timestamp of the target user is at another time (e.g., time T2), the log management system 70 may add the behavior log table (e.g., behavior log table 2) corresponding to the time T2 to the corresponding database (e.g., database 10a shown in fig. 7) for storage.

It should be understood that when a target user accesses a client and generates an interactive behavior with the application client in a recording period, the interactive behavior between the target user and the application client can be recorded in the behavior log table of the recording period. It is understood that the target duration herein may specifically include: one or more recording cycles. Therefore, the behavior log table of the target user acquired by the server in the target duration (i.e. a plurality of recording periods before the application client is accessed this time) may specifically include the behavior log table 1, and the behavior log table 2.

Step S103, based on the template fragments and the template label sequence, screening the video fragments meeting the fragment matching conditions, and taking the video fragments meeting the fragment matching conditions as video material fragments of the target video;

specifically, the server may obtain the target template fragment from the N template fragments, determine a queue position of the target template fragment as a target queue position in the template tag sequence, and determine a template attribute tag corresponding to the target queue position as a target template attribute tag. The number of the template segments may be N, where N may be a positive integer greater than 1, and thus, the template tag sequence may include N sequence positions, one sequence position corresponds to one template attribute tag, and one template attribute tag corresponds to one template segment. Further, the server may screen a segment attribute tag matched with the target template attribute tag from segment attribute tags corresponding to the video segments, and determine the video segment corresponding to the screened segment attribute tag as a candidate video segment. Further, the server may perform similarity analysis on each candidate video segment in the candidate video segments and the target template segment to obtain a similarity threshold of each candidate video segment and the target template segment, determine a maximum similarity threshold in the similarity thresholds, and determine the candidate video segment corresponding to the maximum similarity threshold as the target candidate video segment matched with the target template segment. Further, the server may determine, based on a target queue position of the target template segment in the template tag sequence, a target tag sequence formed by segment attribute tags corresponding to the target candidate video segments, and determine, according to each target candidate video segment associated with the target tag sequence, a video material segment that satisfies a segment matching condition. And the target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence.

It is to be understood that the similarity analysis may represent a scene similarity between the candidate video segment and the target template segment, the candidate video segment is input to the third network model, the candidate feature vector corresponding to the candidate video segment may be obtained, the target template segment is input to the third network model, the target feature vector corresponding to the target template segment may be obtained, a vector distance between the candidate feature vector and the target feature vector may be calculated, so as to obtain a similarity between the candidate video segment and the target template segment (i.e., the similarity threshold), and considering that the third network model is a scene recognition model, the similarity here may represent the scene similarity. The similarity analysis can also represent the degree of distant view and perspective view similarity between the candidate video clip and the target template clip, and the similarity analysis can also represent the degree of character similarity between the candidate video clip and the target template clip.

For example, the target template segment may be input into the third network model to obtain a target feature vector of the target template segment, and assuming that there are 2 candidate video segments, the 2 candidate video segments may specifically include: candidate video segment 1 and candidate video segment 2, wherein the 2 candidate video segments are input into the third network model, and candidate feature vector 1 of candidate video segment 1 and candidate feature vector 2 of candidate video segment 2 are obtained. After the vector distances between the target feature vectors and the 2 candidate feature vectors are calculated, if the distance between the target feature vector and the candidate feature vector 2 is the minimum, the similarity threshold between the target template segment and the candidate video segment 2 is the maximum similarity threshold, and the candidate video segment 2 corresponding to the candidate feature vector 2 may be used as the target candidate video segment matched with the target template segment. The similarity analysis may also represent a time length relationship between the candidate video segment and the target template segment, and the calculation method of the similarity analysis is not particularly limited in the present application.

For convenience of understanding, please refer to fig. 8a and fig. 8b, where fig. 8a is a schematic view of a scene for performing video analysis on a template video according to an embodiment of the present application, and fig. 8b is a schematic view of a scene for performing video analysis on a target video according to an embodiment of the present application. After the template video is subjected to video analysis, N template segments shown in fig. 8a may be obtained, where N may be a positive integer greater than 1, for example, N is equal to 4, and then 4 template segments may include: template segment 80a, template segment 80b, template segment 80c, and template segment 80 d. The template attribute tag corresponding to the template fragment 80a is { long shot }, the template attribute tag corresponding to the template fragment 80b is { close-up character }, the template attribute tag corresponding to the template fragment 80c is { close-up character }, and the template attribute tag corresponding to the template fragment 80d is { close-up object }. After video analysis is performed on the target video, M video segments shown in fig. 8b can be obtained, where M may be a positive integer greater than 1, for example, M is equal to 8, and then the 8 video segments may include: video segment 800a, video segment 800b, video segment 800c, video segment 800d, video segment 800e, video segment 800f, video segment 800g, and video segment 800 h. The segment attribute tag corresponding to the video segment 800a is { long shot }, the segment attribute tag corresponding to the video segment 800b is { short shot of character }, the segment attribute tag corresponding to the video segment 800c is { long shot }, the segment attribute tag corresponding to the video segment 800d is { short shot of character }, the segment attribute tag corresponding to the video segment 800e is { short shot of character }, the segment attribute tag corresponding to the video segment 800f is { long shot }, the segment attribute tag corresponding to the video segment 800g is { short shot of object }, and the segment attribute tag corresponding to the video segment 800h is { short shot of character }.

It is to be understood that, if the template fragment 80a is obtained from the 4 template fragments in fig. 8a as the target template fragment (e.g., the target template fragment 1), the queue position of the target template fragment 1 may be position 1 (i.e., the target queue position is position 1), and the template attribute tag of the target template fragment may be { perspective } (i.e., the target template attribute tag is { perspective }). The 8 video clips in fig. 8b are screened out that the clip attribute tag matching with the target template attribute tag is { long shot }, the video clip corresponding to { long shot } is video clip 800a, video clip 800c and video clip 800f, and then the candidate video clip corresponding to target template clip 1 is video clip 800a, video clip 800c and video clip 800 f. Further, after calculating the similarity threshold between the 3 candidate video segments and the target template segment 1, if the similarity threshold between the video segment 800a and the target template segment 1 is the maximum similarity threshold, the video segment 800a is determined as the target candidate video segment (e.g., target candidate video segment 1) matching the target template segment 1.

Similarly, it can be understood that if the template fragment 80b is obtained from the 4 template fragments in fig. 8a as the target template fragment (e.g., the target template fragment 2), the queue position of the target template fragment 2 may be position 2 (i.e., the target queue position is position 2), and the template attribute tag of the target template fragment may be { character close } (i.e., the target template attribute tag is { character close }). If the video segment corresponding to the segment attribute tag matching with the target template attribute tag is screened out as the video segment 800h in the 8 video segments in fig. 8b, the video segment 800h is determined as the target candidate video segment (e.g., the target candidate video segment 2) matching with the target template segment 2.

Similarly, it is understood that if the template fragment 80c is obtained from the 4 template fragments in fig. 8a as the target template fragment (e.g., the target template fragment 3), the queue position of the target template fragment 3 may be position 3 (i.e., the target queue position is position 3), and the template attribute tag of the target template fragment may be { close to character } (i.e., the target template attribute tag is { close to character }). In the 8 video clips in fig. 8b, the video clips corresponding to the attribute tag of the target template being { close-up of character }, and the video clips corresponding to the attribute tag of the { close-up of character } are screened as video clip 800d and video clip 800e, and then the candidate video clips corresponding to the target template clip 3 are video clip 800d and video clip 800 e. Further, after calculating the similarity threshold between the 2 candidate video segments and the target template segment 3, if the similarity threshold between the video segment 800e and the target template segment 3 is the maximum similarity threshold, the video segment 800e is determined as the target candidate video segment (e.g., target candidate video segment 3) matching the target template segment 3.

Similarly, it is understood that if the template fragment 80d is obtained from the 4 template fragments in fig. 8a as the target template fragment (e.g., the target template fragment 4), the queue position of the target template fragment 4 may be position 4 (i.e., the target queue position is position 4), and the template attribute tag of the target template fragment may be { object close view } (i.e., the target template attribute tag is { object close view }). If the video segment corresponding to the segment attribute label of { object near view } and { object near view } which matches the target template attribute label is screened out from the 8 video segments in fig. 8b as video segment 800g, the video segment 800g is determined as the target candidate video segment (e.g., target candidate video segment 4) matching the target template segment 4.

Therefore, if the target candidate video segment 1 corresponding to the position 1 is the video segment 800a, the target candidate video segment 2 corresponding to the position 2 is the video segment 800h, the target candidate video segment 3 corresponding to the position 3 is the video segment 800e, and the target candidate video segment 4 corresponding to the position 4 is the video segment 800g, video material segments can be determined from the video segment 800a, the video segment 800h, the video segment 800e, and the video segment 800g based on the position 1, the position 2, the position 3, and the position 4. The template tag sequence is a sequence formed by template attribute tags corresponding to the template fragments, and can be expressed as { distant view, character close-up, and object close-up }; the target tag sequence is a sequence formed by segment attribute tags corresponding to the video segments matched with the template segments, and the target tag sequence can be expressed as { long shot, character close-up, object close-up }.

It can be understood that the target template segment 1 may have a similar video playing effect as the target candidate video segment 1, the target template segment 2 may have a similar video playing effect as the target candidate video segment 2, the target template segment 3 may have a similar video playing effect as the target candidate video segment 3, and the target template segment 4 may have a similar video playing effect as the target candidate video segment 4, so that the video material segments may have the same video playing effect as the above template segments.

It should be understood that when determining the video material segments according to each target candidate video segment associated with the target tag sequence, the server may perform video splicing processing on each target candidate video segment associated with the target tag sequence to obtain spliced video data associated with the N template segments. Further, the server can acquire template audio data associated with the N template segments, and perform audio and video merging processing on the template audio data and the spliced video data through the audio and video synthesis component to obtain video material segments meeting the segment matching conditions.

The tool for performing video splicing processing on each target candidate video clip and performing audio and video merging processing on the template audio data and the spliced video data can be the same tool, the tool can be the audio and video synthesis component, the audio and video synthesis component can be an ffmpeg tool, and can also be a software tool with video decapsulation capability for other third parties, and the video decapsulation components are not illustrated one by one here.

And step S104, pushing the video data and the video material segments to an application client corresponding to the target user so that the application client outputs the video data and the video material segments.

The application client can play the video data and the video material segments in the application display interface after receiving the video data and the video material segments. Optionally, when the application client plays the video data, the application client may be further configured to display a thumbnail of each video material segment, where a specific implementation form of the application client outputting the video material segments is not limited.

In this embodiment of the application, when the server obtains video data of a certain video requested by a target user, the server may perform video analysis on the video data to obtain one or more video segments of the video data. It can be understood that the video analysis related to the embodiment of the present application mainly includes: video split-mirror and attribute analysis. The video segmentation mainly means that the video data can be divided into one or more segment segments, so that the server can further perform attribute analysis on the segment content of each segment to obtain the segment attribute tag of each segment, and thus the segment segments with the segment attribute tag are collectively referred to as the aforementioned video segments, and it should be understood that one video segment may correspond to one segment attribute tag. Further, when the user portrait of the target user is obtained, the server can quickly determine the video template associated with the target user according to the user portrait, and then can intelligently screen the video clips meeting the clip matching conditions from the video clips when obtaining the template clips (for example, popular short videos) mapped by the video template and the template tag sequences corresponding to the template clips, so that the screened video clips meeting the clip matching conditions can be used as the video material clips of the target video. It can be understood that, the target tag sequence formed by the segment attribute tags of the video material segment may be the same as the template tag sequence, so as to ensure that the video material segment and the template segment have the same video playing effect. Then, the server can intelligently push the video data and the video material segments to an application client corresponding to the target user, so that the application client can output the video data and the video material segments. Therefore, in the embodiment of the application, one or more video clips carrying the clip attribute tags can be obtained quickly through video analysis (for example, video split mirror, attribute analysis and the like). Therefore, for the video clips, when one or more video templates are intelligently determined according to the user portrait, the video clips can be intelligently screened according to the template tag sequences of the video templates, so that the video clips with the video playing effect similar to that of the video templates can be quickly obtained, and further the video material clips can be quickly synthesized (for example, short videos capable of being pushed to target users can be quickly obtained).

Further, please refer to fig. 9, where fig. 9 is a schematic flowchart of a video data processing method according to an embodiment of the present application. As shown in fig. 9, the method may be executed by an application client and a server, where the application client may be an application client running in the user terminal X in the embodiment corresponding to fig. 2, and the server may be a server in the embodiment corresponding to fig. 2. Wherein the method may comprise the steps of:

step S201, an application client can respond to the play operation executed by a target user aiming at a target video, generate a video play request for requesting to play the target video, and send the video play request to a server;

the video playing request can carry a video identifier of a target video, and the video identifier is used for indicating a server to acquire video data of the target video requested to be played by a target user. The playing operation may include a contact operation such as clicking, long-pressing, sliding, and the like, and may also include a non-contact operation such as voice, gesture, and the like, which is not limited herein.

Step S202, a server acquires video data of a target video requested by a target user, and performs video analysis on the video data to obtain a video segment of the video data;

step S203, the server obtains a user portrait of a target user, determines a video template associated with the target user based on the user portrait, and obtains a template fragment mapped by the video template and a template label sequence corresponding to the template fragment;

step S204, the server screens video clips meeting the clip matching conditions from the video clips based on the template clips and the template label sequence, and takes the video clips meeting the clip matching conditions as video material clips of the target video;

step S205, the server pushes the video data and the video material segments to an application client corresponding to a target user;

for ease of understanding, please refer to fig. 10, fig. 10 is a schematic flowchart illustrating a process of generating a video material segment according to an embodiment of the present application. As shown in fig. 10, when acquiring a highlight short video (i.e., a template video), the server may perform video analysis on the highlight short video to obtain one or more video clips of the highlight short video, and may further use the one or more video clips of the highlight short video as the template clip. It can be understood that the video analysis related to the embodiment of the present application mainly includes: video split-mirror and attribute analysis. The video segmentation mainly refers to dividing the video data of the wonderful short video into one or more segmentation segments, so that the server can further perform attribute analysis (i.e., segmentation information extraction) on the segment content of each segmentation segment to obtain a template attribute tag (i.e., a scene tag, a character tag (i.e., a role tag) and a distance mirror tag shown in fig. 10) of each segmentation segment, so that the segmentation segments with the template attribute tag are collectively referred to as the template segments, and a hot highlight sequence (i.e., a shot sequence record) can be determined based on the template attribute tag. It should be understood that one template fragment may correspond to one template attribute tag. The hot highlight sequence 1 in the highlight sequence library shown in fig. 10 may be a template attribute tag corresponding to the template fragment 1, the hot highlight sequence 2 may be a template attribute tag corresponding to the template fragment 2, and the hot highlight sequence 3 may be a template attribute tag corresponding to the template fragment 3.

It should be understood that the embodiments of the present application may collectively refer to the template segment of the template video (i.e., the highlight short video described above), the template tag sequence of the template segment, and the template audio data (i.e., music) as the video template.

As shown in fig. 10, when a tv play (i.e., a target video) is obtained, the server may perform video mirroring and attribute analysis on the tv play to obtain one or more video segments of the tv play. It should be understood that one video clip may correspond to one clip attribute tag. Thus, the server can obtain one or more hot highlight sequences (namely sequence sampling) from the highlight sequence library, further determine template segments and template label sequences corresponding to the template segments according to the selected hot highlight sequences, screen and sort the video segments of the target video to obtain the screened video segments (namely, the split segment mirror sequence arrangement based on material matching), and further intelligently generate video material segments similar to the television series according to spliced video data formed by the screened video segments and template audio data of the template segments.

The method comprises the steps of extracting the wonderful short videos in each short video platform, obtaining video templates corresponding to the wonderful short videos, achieving continuous multi-day accumulation of the video templates, generating one or more video material segments of corresponding styles for the television series according to the video templates, and enriching the styles of the finally generated video material segments. The television play can generate video material segments of various styles according to the video templates, can be used for recommending and selecting thousands of people in a video recommendation scene, can perform video analysis and video matching on the wonderful short video and the television play through a deep learning and image analysis algorithm for each video template, and can achieve the aim of automatic analysis. In addition, for a new television play, the analysis of the television play can be completed only by limited migration capability, so that the generation difficulty of the new television play is low, and the migration is high.

It should be understood that, for a specific process of performing video mirroring and attribute analysis on the drama by the server, reference may be made to the description of step S102 above, and details will not be further described here. It should be understood that, for a specific process of performing video mirroring and attribute analysis on the highlight short video by the server, reference may be made to the description of performing video mirroring and attribute analysis on the tv play by the server, and details will not be further described here.

And step S206, the application client outputs the video data and the video material segments in the application display interface.

Specifically, the application client may receive video data returned by the server based on the video playing request and a video material segment associated with the target video, and may determine a video playing interface for playing the video data in an application display interface of the application client, so that the video data may be played in the video playing interface. Further, the application client may respond to a trigger operation for the application display interface, and play the corresponding video material segment in the application display interface of the application client. The triggering operation may include a contact operation such as a click, a long press, a slide, and the like, and may also include a non-contact operation such as a voice, a gesture, and the like, which is not limited herein. Optionally, it may be understood that, after the application client acquires the video material segments, a thumbnail of each video material segment may also be displayed in the application display interface, or an animation of each video material segment may be dynamically played in the application display interface, where a specific display form of the video material segments is not limited here.

For ease of understanding, please refer to fig. 11, fig. 11 is a schematic flowchart illustrating a front-end and back-end interaction provided by an embodiment of the present application. It is to be understood that the application client may run on the front end B shown in fig. 11, a play operation performed by a target user for a target video (e.g., a video in which the target user is interested) in the application client of the front end B is to input the target video to the front end B, and then the server (i.e., the back end) may generate one or more video material segments associated with the target video (i.e., back end generation) based on the video template, and then the server may return the video data of the target video and the one or more video material segments associated with the target video (e.g., a video flower of the video, etc.) to the front end B, that is, the video data and the video material segments returned by the server are displayed in the application display interface of the front end B. It should be appreciated that the video template herein may be determined by the server based on a user representation of the target user.

It is to be understood that, as shown in fig. 11, the front end a may be another user terminal corresponding to a video editor, and after performing video analysis on the highlight short video input by the front end a, the video editor may select one or more video segments as template segments from the video segments obtained by the video analysis, and may further determine a video template (i.e., mine the highlight video template) based on the template segments. The front end a can receive the input of the highlight short video, and then upload the video template corresponding to the highlight short video (i.e., the highlight video template) to the server for storage (i.e., back-end storage).

It should be understood that the front end B and the front end a may also be the same user terminal, that is, the front end B (or the front end a) may be an input side of the highlight video, and may also be an input side of the target video.

For easy understanding, please refer to fig. 12a, and fig. 12a is a schematic view of a scene of an output video material segment according to an embodiment of the present application. As shown in fig. 12a, the application display interface 120a may be the application display interface in the embodiment corresponding to fig. 2, the application display interface 120a may include a video playing interface 1 for playing a target video, and may further include a short video recommendation list (e.g., the short video recommendation list 1) for displaying or playing a video material segment, where the short video recommendation list 1 may include at least a video material segment associated with the target video, and the video material segment may be a video material segment associated with the target video in the first service database. After the target user performs a triggering operation (e.g., a sliding operation shown in fig. 12 a) on the application display interface 120a, the application client may display or play the video material segments in the short video recommendation list 1 in the highlight recommendation part of the application display interface 120 b. Optionally, when the application client plays the target video in the video playing interface 1, the application client may also play (or synchronously play) the video material segments in the short video recommendation list 1 in a traversal manner. As shown in fig. 12a, the video recommendation list 1 may specifically include N video material segments associated with the target video, where the N video material segments may specifically be 3 video material segments shown in fig. 12a, for example, the 3 video material segments may specifically include: video material segment a1, video material segment a2, and video material segment A3.

Optionally, after the target user performs a triggering operation (for example, a clicking operation) on the service recommendation control in the application display interface 120a, the application client may display or play the video material segment in the short video recommendation list 1 in the highlight recommendation part of the application display interface 120b, for example, the video material segment a1, the video material segment a2, the video material segment A3, and the like in the application display interface 120 b.

For easy understanding, please refer to fig. 12b, and fig. 12b is a schematic view of a scene for updating a video material segment according to an embodiment of the present application. As shown in fig. 12b, when the target user performs a trigger operation (e.g., a click operation) with respect to the video material segment a1 of fig. 12a described above, the server may return the video data (e.g., video data J) of this video material segment a1 and one or more video material segments associated with this video data J (e.g., video material segment C1, video material segment C2, and video material segment C3) to the application client to play this video data J in the application client. Optionally, the application client may also display the received video material segments together when playing the video data J of the video material segment a1, so as to obtain the application display interface 120 c.

The application display interface 120c may include a video playing interface 2 for playing video data J, and may further include a short video recommendation list (e.g., short video recommendation list 2) for presenting video material segments, where the short video recommendation list 2 may include at least video material segments associated with the video data J. After the target user performs a triggering operation (e.g., a clicking operation shown in fig. 12 b) on the service recommendation control in the application display interface 120c, the application client may display or play the video material segment in the short video recommendation list 2 in the highlight recommendation part of the application display interface 120d, where the video material segment may be a video material segment in the second service database that has the same video template as the video material segment a 1. As shown in fig. 12b, the short video recommendation list 2 may specifically include M video material segments associated with the video data J, where the M video material segments may specifically be 3 video material segments shown in fig. 12b, for example, the 3 video material segments may specifically include: video material segment C1, video material segment C2, and video material segment C3.

Optionally, after the target user performs a triggering operation (e.g., a sliding operation) on the application display interface 120C, the application client may display or play the video material segments in the short video recommendation list 2, such as the video material segment C1, the video material segment C2, the video material segment C3, and the like in the application display interface 120d, in the highlight recommendation part of the application display interface 120 d.

It should be understood that after the target user has viewed the video material segment a1 in the application client, the video material segments in the short video recommendation list 2 may be played for the target user in an intelligent manner. For example, when the target user finishes viewing the video material segment a1 in the application client, the server may further output a video material segment C1 in the plurality of video material segments in the short video recommendation list 2 to the application client, so as to implement smart playing of the video material segment C1 in the application client. Optionally, when the video data played in the video playing interface 1 of the application client is updated to the video material segment a1, the application client may further record the current playing progress (for example, time T) of the target video, so that after the video material segment a1 is played, the target video is continuously played from the time T of the target video.

The application client can dynamically adjust the positions of the video material segments in the short video recommendation list in real time according to the current playing progress of the target video, so as to recommend the video material segments with different ranks for the target user. For example, if all the video clips constituting the video material clip are included before the current playing progress, that is, all the video clips constituting the video material clip are already viewed at the current time, the video material clip may be arranged in front of the short video recommendation list, that is, the scenario playback is realized. Optionally, the application client may also sort the video material segments in the video recommendation list according to the playing times of the current video material segment on the application clients in other user terminals, and if the total playing times of a certain video material segment is higher, it indicates that the quality of the video material segment is higher, the video material segment may be preferentially recommended to the target user, that is, the video material segment is arranged in front of the short video recommendation list.

Therefore, in the embodiment of the application, one or more video clips carrying the clip attribute tags can be obtained quickly by performing video analysis (for example, video split-mirror analysis, attribute analysis and the like) on video data. Therefore, for the video clips, when one or more video templates are accurately determined according to the user portrait, the video clips can be intelligently screened respectively according to the template tag sequences of the video templates, so that the video clips with the video playing effect similar to that of the video templates can be quickly obtained, and further the video material clips can be quickly synthesized (for example, short videos which can be displayed for target users can be quickly obtained).

Further, please refer to fig. 13, fig. 13 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus 1 may include: the system comprises a fragment generation module 30, a model acquisition module 40, a material determination module 50 and a data transmission module; further, the video data processing apparatus 1 may further include: a request receiving module 10, a data searching module 20;

the segment generating module 30 is configured to obtain video data of a target video requested by a target user, perform video analysis on the video data, and obtain a video segment of the video data; one video clip corresponds to one clip attribute tag;

the fragment generation module 30 includes: a model obtaining unit 301, a split mirror obtaining unit 302, a label determining unit 303 and a fragment determining unit 304;

a model acquisition unit 301 configured to acquire video data of a target video requested by a target user and a network identification model associated with the video data;

a split-mirror obtaining unit 302, configured to perform split-mirror processing on a video sequence corresponding to the video data through the video splitting component, so as to obtain a split-mirror segment associated with the video sequence;

wherein, the split mirror obtaining unit 302 includes: an assembly acquisition subunit 3021, an image matching subunit 3022, a split mirror creation subunit 3023, a matching completion subunit 3024, and a split mirror determination subunit 3025;

the component acquiring subunit 3021 is configured to acquire a video segmentation component for performing a mirroring process on a video sequence of the video data, acquire, by the video segmentation component, a first video frame serving as a cluster centroid in the video sequence, and create mirroring cluster information of a mirroring cluster to which the first video frame belongs;

an image matching subunit 3022, configured to determine, in a video sequence, video frames other than the first video frame as second video frames, sequentially acquire the second video frames based on a polling mechanism, and determine image similarity between the second video frames and the first video frames;

a mirror creating subunit 3023, configured to, if the image similarity between the first video frame and the second video frame is greater than or equal to the clustering threshold, divide the second video frame whose image similarity is greater than or equal to the clustering threshold into a mirror cluster to which the first video frame belongs;

a matching completion subunit 3024, configured to update the first video frame with the second video frame whose image similarity is smaller than the clustering threshold if the image similarity between the first video frame and the second video frame is smaller than the clustering threshold, create the clustering cluster information of the clustering cluster to which the updated first video frame belongs, and perform image similarity matching on the updated first video frame and the unmatched second video frame in sequence until all the video frames in the video sequence complete image similarity matching, so as to obtain the clustering cluster information of the clustering cluster to which the video frame belongs in the video sequence;

a mirror determination subunit 3025, configured to determine a mirror segment associated with the video sequence based on the mirror cluster information of the mirror cluster to which the video frame in the video sequence belongs.

For specific implementation manners of the component obtaining subunit 3021, the image matching subunit 3022, the split mirror creating subunit 3023, the matching completing subunit 3024, and the split mirror determining subunit 3025, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described again here.

The label determining unit 303 is configured to input the split mirror segments into the network identification model, and perform attribute analysis on the split mirror segments by using the network identification model to obtain segment attribute labels corresponding to the split mirror segments;

the tag determination unit 303 includes: a first analysis sub-unit 3031, a face detection sub-unit 3032, a second analysis sub-unit 3033, a third analysis sub-unit 3034 and a label analysis sub-unit 3035;

a first analysis subunit 3031, configured to input the sub-lens segments into a first network model, perform distance and near view analysis on each of the sub-lens segments through the first network model to obtain distance and near view labels of the sub-lens segments, use the distance and near view labels of the sub-lens segments as first attribute labels output by the first network model, and use the sub-lens segments with the first attribute labels as first class sub-lens segments;

the face detection subunit 3032 is configured to input the first class of partial mirror segments into the second network model, and perform face detection on each partial mirror segment in the first class of partial mirror segments by using the second network model to obtain a face detection result;

a second analysis subunit 3033, configured to, if the face detection result indicates that a face of the target role exists in the first class of partial mirror segments, take a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segment in the first class of partial mirror segment, determine, through the second network model, a role label to which the target role in the second class of partial mirror segment belongs, and determine, as a second attribute label of the second class of partial mirror segment, the role label to which the target role belongs; the target role is one or more roles in the target video;

a third analyzing subunit 3034, configured to determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as third class of partial mirror segments, input the third class of partial mirror segments into a third network model, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by using the third network model, to obtain a third attribute tag of the third class of partial mirror segments;

the label analyzing subunit 3035 is configured to determine, according to the first attribute label of the first class of mirror segments, the second attribute label of the second class of mirror segments, and the third attribute label of the third class of mirror segments, a segment attribute label corresponding to each of the mirror segments.

For specific implementation manners of the first analyzing subunit 3031, the face detecting subunit 3032, the second analyzing subunit 3033, the third analyzing subunit 3034 and the tag analyzing subunit 3035, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

And a section determining unit 304, configured to determine the split section with the section attribute tag as a video section of the video data.

For specific implementation manners of the model obtaining unit 301, the split mirror obtaining unit 302, the label determining unit 303, and the segment determining unit 304, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

The template acquisition module 40 is used for acquiring a user portrait of a target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

wherein, the template obtaining module 40 includes: a behavior extraction unit 401, a behavior analysis unit 402, a template analysis unit 403;

a behavior extracting unit 401, configured to obtain a behavior log table of a target user, and extract behavior data information associated with the target user from the behavior log table;

a behavior analysis unit 402, configured to perform user image analysis on the behavior data information to obtain a user image for representing the target user, and determine a video template associated with the target user based on the user image; the video template carries a template label sequence formed by template attribute labels of the template fragments; the template fragment is obtained by performing video analysis on the template video; the template video is determined by the behavior data information;

the template analysis unit 403 is configured to obtain a template segment mapped by the video template and a template tag sequence corresponding to the template segment.

For specific implementation manners of the behavior extracting unit 401, the behavior analyzing unit 402, and the template analyzing unit 403, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, which will not be described herein again.

The material determining module 50 is configured to screen, based on the template segments and the template tag sequences, video segments that meet segment matching conditions from among the video segments, and use the video segments that meet the segment matching conditions as video material segments of the target video; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

the material determination module 50 includes: a label determining unit 501, a label screening unit 502, a segment matching unit 503 and a material generating unit 504;

a tag determining unit 501, configured to obtain a target template segment from the N template segments, determine a queue position of the target template segment as a target queue position in a template tag sequence, and determine a template attribute tag corresponding to the target queue position as a target template attribute tag;

a tag screening unit 502, configured to screen a segment attribute tag matched with the target template attribute tag from segment attribute tags corresponding to video segments, and determine a video segment corresponding to the screened segment attribute tag as a candidate video segment;

a segment matching unit 503, configured to perform similarity analysis on each candidate video segment in the candidate video segments and the target template segment to obtain a similarity threshold between each candidate video segment and the target template, determine a maximum similarity threshold in the similarity thresholds, and determine a candidate video segment corresponding to the maximum similarity threshold as a target candidate video segment matched with the target template segment;

the material generating unit 504 is configured to determine a target tag sequence formed by segment attribute tags corresponding to target candidate video segments based on target queue positions of the target template segments in the template tag sequence, and determine video material segments meeting a segment matching condition according to each target candidate video segment associated with the target tag sequence.

The material generation unit 504 includes: a video stitching sub-unit 5041, a material composition sub-unit 5042;

a video splicing subunit 5041, configured to perform video splicing processing on each target candidate video segment associated with the target tag sequence to obtain spliced video data associated with N template segments;

and the material synthesis subunit 5042 is configured to obtain template audio data associated with the N template segments, and perform audio and video merging processing on the template audio data and the spliced video data through the audio and video synthesis component to obtain video material segments meeting the segment matching condition.

For a specific implementation manner of the video splicing sub-unit 5041 and the material synthesizing sub-unit 5042, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, which will not be repeated herein.

For specific implementation manners of the tag determining unit 501, the tag screening unit 502, the segment matching unit 503, and the material generating unit 504, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, which will not be described herein again.

And the data sending module 60 is configured to push the video data and the video material segments to an application client corresponding to the target user, so that the application client outputs the video data and the video material segments.

Optionally, the request receiving module 10 is configured to receive a video playing request sent by an application client; the video playing request is generated by the application client end responding to the playing operation executed by the target user aiming at the target video;

and the data searching module 20 is configured to extract a video identifier of the target video from the video playing request, search service video data corresponding to the target video in the video service database based on the video identifier, and use the searched service video data as video data of the target video in the application client.

For specific implementation manners of the segment generating module 30, the model obtaining module 40, the material determining module 50, and the data sending module, reference may be made to the description of step S101 to step S104 in the embodiment corresponding to fig. 3, which will not be described herein again. Optionally, for a specific implementation manner of the request receiving module 10 and the data searching module 20, reference may be made to the description of step S201 and step S206 in the embodiment corresponding to fig. 9, which will not be described again here. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 14, fig. 14 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus 2 may include: a data acquisition module 70, a data output module 80;

the data acquisition module 70 is used for responding to the playing operation executed by the target user aiming at the target video in the application client, and acquiring the video data of the target video and the video material segment associated with the target video from the server by the user; the video material segments are obtained by screening the video segments of the target video by the server according to the template segments of the video template and the template label sequences corresponding to the template segments; the video clip is obtained by video analysis of the video data by the server; the video template is determined by the server based on a user representation of the target user; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

the data acquiring module 70 includes: a request transmitting unit 701, a data receiving unit 702;

a request sending unit 701, configured to respond to a play operation performed by a target user for a target video in an application client, generate a video play request for requesting to play the target video, and send the video play request to a server; the video playing request carries a video identifier of a target video; the video identification is used for indicating the server to acquire video data of a target video requested to be played by a target user;

a data receiving unit 702, configured to receive video data returned by the server based on the video playing request, and video material segments associated with the target video; the video material segments are obtained by performing video analysis and video matching on video data according to a video template when the server determines the video template according to the user image of the target user, wherein the user image is determined by user behavior information of the target user in an application client.

For specific implementation of the request sending unit 701 and the data receiving unit 702, reference may be made to the description of step S201 in the embodiment corresponding to fig. 9, which will not be described herein again.

And a data output module 80, configured to output the video data and the video material segments in an application display interface of the application client.

Wherein, the data output module includes: a video playing unit 801, a material output unit 802;

a video playing unit 801, configured to determine a video playing interface used for playing video data in an application display interface of an application client, and play the video data in the video playing interface;

and the material output unit 802 is configured to respond to a trigger operation for the application display interface, and play the video material segment in the application display interface.

For specific implementation of the video playing unit 801 and the material output unit 802, reference may be made to the description of step S207 in the embodiment corresponding to fig. 9, which will not be repeated here.

For a specific implementation manner of the data obtaining module 70 and the data outputting module 80, reference may be made to the description of step S201 and step S206 in the embodiment corresponding to fig. 9, which will not be described herein again. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 15, the computer device 2000 may include: the processor 2001, the network interface 2004 and the memory 2005, the computer device 2000 may further include: a user interface 2003, and at least one communication bus 2002. The communication bus 2002 is used to implement connection communication between these components. The optional user interface 2003 may also include a standard wired interface, a wireless interface. Optionally, the network interface 2004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). Memory 2005 can be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. Alternatively, the memory 2005 may be at least one storage device located remotely from the aforementioned processor 2001. As shown in fig. 15, the memory 2005, which is one type of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 2000 shown in fig. 15, the network interface 2004 may provide a network communication function; and the user interface 2003 is primarily used to provide an interface for user input; and processor 2001 may be used to invoke device control applications stored in memory 2005.

It should be understood that the computer device 2000 described in the embodiments of the present application may be a server or a user terminal, which will not be limited herein. It is understood that the computer device 2000 can be used to execute the description of the video data processing method in the embodiment corresponding to fig. 3 or fig. 9, and the description thereof is not repeated here. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the video data processing apparatus 1 or the video data processing apparatus 2 mentioned above is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the video data processing method in the embodiment corresponding to fig. 3 or fig. 9 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

Further, please refer to fig. 16, fig. 16 is a diagram illustrating a video data processing system according to an embodiment of the present application. The video data processing system 3 may include a server 3a and a user terminal 3b, where the server 3a may be the video data processing apparatus 1 in the embodiment corresponding to fig. 13; the user terminal 3b may be the video data processing apparatus 2 in the embodiment corresponding to fig. 14. It is understood that the beneficial effects of the same method are not described in detail.

Further, it should be noted that: embodiments of the present application also provide a computer program product or computer program, which may include computer instructions, which may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor can execute the computer instruction, so that the computer device executes the description of the video data processing method in the embodiment corresponding to fig. 3 or fig. 9, which is described above, and therefore, the description of this embodiment will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product or the computer program referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of processing video data, comprising:

acquiring video data of a target video requested by a target user and a network identification model associated with the video data, and performing mirror segmentation processing on a video sequence corresponding to the video data through a video segmentation component to obtain a mirror segmentation segment associated with the video sequence;

inputting the partial lens segments into the network identification model, performing far and near view analysis on each partial lens segment in the partial lens segments through the network identification model to obtain far and near view labels of the partial lens segments, and taking the partial lens segments with the far and near view labels as first classification lens segments;

performing face detection on each of the first class of partial mirror segments by the network identification model, if a face detection result indicates that a face of a target role exists in the first class of partial mirror segments, taking a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segment in the first class of partial mirror segments, and determining a role label to which the target role in the second class of partial mirror segment belongs by the network identification model; the target roles are one or more roles in the target video;

determining the segments except the second class of segments in the first class of segments as third class of segments, and performing scene detection on each segment in the first class of segments by the network identification model to obtain scene labels of the third class of segments;

determining a segment attribute label corresponding to each of the split-view segments according to the far and near scene labels, the role labels and the scene labels, and determining the split-view segments with the segment attribute labels as video segments of the video data; one video clip corresponds to one clip attribute tag;

acquiring a user portrait of the target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

screening video clips meeting the clip matching conditions from the video clips based on the template clips and the template tag sequences, and taking the video clips meeting the clip matching conditions as video material clips of the target video; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

2. The method of claim 1, wherein prior to said obtaining video data of a target video requested by a target user, the method further comprises:

receiving a video playing request sent by an application client; the video playing request is generated by the application client in response to the playing operation executed by the target user aiming at the target video;

extracting the video identification of the target video from the video playing request, searching the service video data corresponding to the target video in a video service database based on the video identification, and taking the searched service video data as the video data of the target video in the application client.

3. The method of claim 1, wherein the performing, by a video slicing component, a mirror processing on a video sequence corresponding to the video data to obtain a mirror segment associated with the video sequence comprises:

acquiring a video segmentation component for performing mirror segmentation processing on a video sequence of the video data, acquiring a first video frame serving as a clustering center of mass in the video sequence through the video segmentation component, and creating mirror cluster information of a mirror cluster to which the first video frame belongs;

determining video frames except the first video frame in the video sequence as second video frames, sequentially acquiring the second video frames based on a polling mechanism, and determining the image similarity of the second video frames and the first video frames;

if the image similarity of the first video frame and the second video frame is greater than or equal to a clustering threshold, dividing the second video frame of which the image similarity is greater than or equal to the clustering threshold into a lens cluster to which the first video frame belongs;

if the image similarity of the first video frame and the second video frame is smaller than the clustering threshold, updating the first video frame by using the second video frame of which the image similarity is smaller than the clustering threshold, creating the image similarity cluster information of the image similarity cluster to which the updated first video frame belongs, sequentially performing image similarity matching on the updated first video frame and the unmatched second video frame, and obtaining the image similarity cluster information of the image similarity cluster to which the video frame in the video sequence belongs when the video frames in the video sequence are matched with the image similarity;

determining a split-mirror segment associated with the video sequence based on split-mirror cluster information of a split-mirror cluster to which a video frame in the video sequence belongs.

4. The method of claim 1, wherein the obtaining a user representation of the target user, determining a video template associated with the target user based on the user representation, obtaining a template fragment mapped by the video template and a sequence of template tags corresponding to the template fragment, comprises:

acquiring a behavior log table of the target user, and extracting behavior data information associated with the target user from the behavior log table;

performing user portrait analysis on the behavior data information to obtain a user portrait for representing the target user, and determining a video template associated with the target user based on the user portrait; the video template carries a template label sequence formed by template attribute labels of the template fragments; the template fragment is obtained by performing video analysis on a template video; the template video is determined by the behavior data information;

and acquiring the template fragments mapped by the video template and the template label sequence corresponding to the template fragments.

5. The method of claim 1, wherein the number of template fragments is N, wherein N is a positive integer greater than 1; the template tag sequence comprises N sequence positions, one sequence position corresponds to one template attribute tag, and one template attribute tag corresponds to one template fragment;

the screening, based on the template segment and the template tag sequence, of the video segments that meet segment matching conditions, and taking the video segments that meet the segment matching conditions as video material segments of the target video includes:

acquiring target template fragments from the N template fragments, determining queue positions of the target template fragments as target queue positions in the template label sequence, and determining template attribute labels corresponding to the target queue positions as target template attribute labels;

screening a segment attribute label matched with the target template attribute label from segment attribute labels corresponding to the video segments, and determining the video segment corresponding to the screened segment attribute label as a candidate video segment;

performing similarity analysis on each candidate video clip in the candidate video clips and the target template clip to obtain a similarity threshold of each candidate video clip and the target template clip, determining a maximum similarity threshold in the similarity thresholds, and determining the candidate video clip corresponding to the maximum similarity threshold as the target candidate video clip matched with the target template clip;

and determining a target label sequence formed by the segment attribute labels corresponding to the target candidate video segments based on the target queue positions of the target template segments in the template label sequence, and determining the video material segments meeting the segment matching conditions according to each target candidate video segment associated with the target label sequence.

6. The method of claim 5, wherein determining, according to each target candidate video segment associated with the target tag sequence, video material segments satisfying segment matching conditions comprises:

performing video splicing processing on each target candidate video segment associated with the target label sequence to obtain spliced video data associated with the N template segments;

and acquiring template audio data associated with the N template fragments, and performing audio and video combination processing on the template audio data and the spliced video data through an audio and video synthesis component to obtain video material fragments meeting the fragment matching conditions.

7. A method of processing video data, comprising:

responding to a playing operation executed by a target user aiming at a target video in an application client, and acquiring video data of the target video and video material segments associated with the target video from a server; the video material segments are obtained by screening the video segments of the target video by the server according to the template segments of the video template and the template label sequences corresponding to the template segments; the server is used for acquiring video data of a target video requested by a target user and a network identification model associated with the video data, and performing mirror segmentation processing on a video sequence corresponding to the video data through a video segmentation component to obtain a mirror segmentation segment associated with the video sequence; the server is further used for inputting the lens segments into the network identification model, performing far and near view analysis on each lens segment in the lens segments through the network identification model to obtain a far and near view label of the lens segment, and taking the lens segment with the far and near view label as a first type lens segment; the server is further configured to perform face detection on each of the first class of partial mirror segments by the network identification model, and if a face detection result indicates that a face of a target role exists in the first class of partial mirror segments, determine, in the first class of partial mirror segments, a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segments, and determine, by the network identification model, a role label to which the target role in the second class of partial mirror segments belongs; the target roles are one or more roles in the target video; the server is further configured to determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as third class of partial mirror segments, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by using the network identification model to obtain a scene tag of the third class of partial mirror segments; the server is further configured to determine, according to the far and near scene tags, the role tags and the scene tags, a segment attribute tag corresponding to each of the segment-based frames, and determine the segment-based frame with the segment attribute tag as a video segment of the video data; one video clip corresponds to one clip attribute tag; the video template is determined by the server based on a user representation of the target user; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

8. The method of claim 7, wherein the obtaining video data of the target video and video material segments associated with the target video from a server in response to a play operation performed by a target user for the target video in an application client comprises:

responding to a playing operation executed by a target user aiming at a target video in an application client, generating a video playing request for requesting to play the target video, and sending the video playing request to a server; the video playing request carries a video identifier of the target video; the video identifier is used for indicating the server to acquire video data of a target video requested to be played by the target user;

receiving the video data returned by the server based on the video playing request and video material segments associated with the target video; and when the server determines a video template according to the user image of the target user, the video material segment is obtained by performing video analysis and video matching on the video data according to the video template, wherein the user image is determined by the user behavior information of the target user in the application client.

9. The method of claim 7, wherein outputting the video data and the video material segments in an application display interface of the application client comprises:

determining a video playing interface for playing the video data in an application display interface of the application client, and playing the video data in the video playing interface;

and responding to the trigger operation aiming at the application display interface, and playing the video material segments in the application display interface.

10. A video data processing apparatus, comprising:

the segment generation module is used for acquiring video data of a target video requested by a target user and a network identification model associated with the video data, and performing mirror segmentation processing on a video sequence corresponding to the video data through a video segmentation component to obtain a mirror segmentation segment associated with the video sequence;

the segment generation module is further configured to input the segment into the network identification model, perform distance and near view analysis on each of the segments through the network identification model to obtain a distance and near view label of the segment, and use the segment with the distance and near view label as a first class of segment;

the segment generation module is further configured to perform face detection on each of the first class of partial mirror segments by the network identification model, and if a face detection result indicates that a face of a target role exists in the first class of partial mirror segments, determine, in the first class of partial mirror segments, a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segments, and determine, by the network identification model, a role label to which the target role in the second class of partial mirror segments belongs; the target roles are one or more roles in the target video;

the segment generating module is further configured to determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as third class of partial mirror segments, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by using the network identification model to obtain a scene tag of the third class of partial mirror segments;

the segment generating module is further configured to determine, according to the far and near scene tags, the role tags and the scene tags, a segment attribute tag corresponding to each of the split segments, and determine a split segment with a segment attribute tag as a video segment of the video data; one video clip corresponds to one clip attribute tag;

the template acquisition module is used for acquiring a user portrait of the target user, determining a video template associated with the target user based on the user portrait, and acquiring a template fragment mapped by the video template and a template tag sequence corresponding to the template fragment;

a material determining module, configured to screen, based on the template segment and the template tag sequence, a video segment that meets a segment matching condition from among the video segments, and use the video segment that meets the segment matching condition as a video material segment of the target video; a target label sequence formed by the segment attribute labels of the video material segments is the same as the template label sequence;

11. A video data processing apparatus, comprising:

the data acquisition module is used for responding to the play operation executed by a target user aiming at a target video in the application client, and acquiring the video data of the target video and a video material segment associated with the target video from a server; the video material segments are obtained by screening the video segments of the target video by the server according to the template segments of the video template and the template label sequences corresponding to the template segments; the server is used for acquiring video data of a target video requested by a target user and a network identification model associated with the video data, and performing mirror segmentation processing on a video sequence corresponding to the video data through a video segmentation component to obtain a mirror segmentation segment associated with the video sequence; the server is further used for inputting the lens segments into the network identification model, performing far and near view analysis on each lens segment in the lens segments through the network identification model to obtain a far and near view label of the lens segment, and taking the lens segment with the far and near view label as a first type lens segment; the server is further configured to perform face detection on each of the first class of partial mirror segments by the network identification model, and if a face detection result indicates that a face of a target role exists in the first class of partial mirror segments, determine, in the first class of partial mirror segments, a partial mirror segment corresponding to the face of the target role as a second class of partial mirror segments, and determine, by the network identification model, a role label to which the target role in the second class of partial mirror segments belongs; the target roles are one or more roles in the target video; the server is further configured to determine, in the first class of partial mirror segments, partial mirror segments other than the second class of partial mirror segments as third class of partial mirror segments, and perform scene detection on each partial mirror segment in the first class of partial mirror segments by using the network identification model to obtain a scene tag of the third class of partial mirror segments; the server is further configured to determine, according to the far and near scene tags, the role tags and the scene tags, a segment attribute tag corresponding to each of the segment-based frames, and determine the segment-based frame with the segment attribute tag as a video segment of the video data; one video clip corresponds to one clip attribute tag; the video template is determined by the server based on a user representation of the target user;

12. A computer device, comprising: a processor, a memory, a network interface;

the processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and a processor for calling the computer program to perform the method of any one of claims 1 to 9.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-9.