WO2022116888A1 - 一种视频数据处理方法、装置、设备以及介质 - Google Patents

一种视频数据处理方法、装置、设备以及介质 Download PDF

Info

Publication number
WO2022116888A1
WO2022116888A1 PCT/CN2021/133035 CN2021133035W WO2022116888A1 WO 2022116888 A1 WO2022116888 A1 WO 2022116888A1 CN 2021133035 W CN2021133035 W CN 2021133035W WO 2022116888 A1 WO2022116888 A1 WO 2022116888A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
template
target
segment
attribute
Prior art date
Application number
PCT/CN2021/133035
Other languages
English (en)
French (fr)
Inventor
郭卉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022116888A1 publication Critical patent/WO2022116888A1/zh
Priority to US17/951,621 priority Critical patent/US20230012732A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/239Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
    • H04N21/2393Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests involving handling client requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration

Definitions

  • the present application relates to the field of computer technology, and in particular, to a video data processing method, apparatus, device, and medium.
  • the short video refers to the video content played on various video playback platforms and suitable for viewing in a mobile state and a short-term leisure state.
  • Embodiments of the present application provide a video data processing method, apparatus, device, and medium.
  • video analysis for example, video mirroring and attribute analysis, etc.
  • one or more videos carrying segment attribute tags can be quickly obtained.
  • Fragments use the video template determined based on the user portrait of the target user to match the attribute tags of the video clips of the target video to generate the video material clips of the target video, which can be reused as the video templates are added and updated.
  • Mirror and attribute information reduce the identification and processing of video frames in the target video, improve the generation efficiency of short videos, save the computing cost of continuously generating and distributing a large number of short videos for different users, and save the computing resources of the server.
  • One aspect of the embodiments of the present application provides a video data processing method, including:
  • Each video clip corresponds to a clip attribute label and a storyboard clip
  • the video template associated with the target user is determined from the video template database, and at least one template segment and template tag sequence predetermined in the video template are acquired, and the template tag sequence is composed of the at least one template segment.
  • Template attribute tag composition
  • the video data and the video material clips are pushed to the application client corresponding to the target user, so that the application client outputs the video data and the video material clips.
  • the embodiments of the present application provide a video data processing apparatus, including:
  • the segment generation module is used to obtain the video data of the target video requested by the target user, and perform video analysis on the video data to obtain multiple video segments, wherein the video analysis includes mirroring processing and attribute analysis based on multiple preset segment attribute tags , each of the multiple video clips corresponds to a clip attribute label and a mirroring clip;
  • the template acquisition module is used to determine the video template associated with the target user from the video template database based on the user portrait of the target user, and obtain at least one template segment and template tag sequence predetermined in the video template, and the template tag sequence consists of at least one.
  • the template attribute tag of a template fragment consists of;
  • the material determination module is configured to, based on the template attribute tag of at least one template fragment and the fragment attribute tags corresponding to the plurality of video fragments, filter at least one video fragment that matches the template attribute tag of the at least one template fragment from the plurality of video fragments, according to the The position of the template attribute label of each template segment in the at least one template segment in the template label sequence, and splicing the matched at least one video segment, as the video material segment of the target video;
  • the data sending module is used for pushing the video data and the video material clips to the application client corresponding to the target user, so that the application client outputs the video data and the video material clips.
  • One aspect of the embodiments of the present application provides a video data processing method, including:
  • the video data of the target video and the video material clips associated with the target video are obtained from the server; the video material clips are obtained by the server performing video analysis on the video data
  • the video analysis includes splitting processing and attribute analysis based on a plurality of preset clip attribute tags, and each video clip in the multiple video clips corresponds to a clip attribute tag and a mirroring clip (that is, say, each video clip is a mirroring clip corresponding to a clip attribute label); based on the user portrait of the target user, determine the video template associated with the target user from the video template database, and obtain the predetermined video template in the video template.
  • the template tag sequence is composed of template attribute tags of at least one template segment; based on the template attribute tag of at least one template segment and the segment attribute tags corresponding to multiple video Screening at least one video clip that matches the template attribute tag of at least one template fragment; obtained by splicing the matched at least one video clip according to the position of the template attribute tag of each template fragment in the at least one template fragment in the template tag sequence;
  • Video data and video clips are output in the application display interface of the application client.
  • the embodiments of the present application provide a video data processing apparatus, including:
  • a data acquisition module the user responds to the playback operation performed by the target user on the target video in the application client, and obtains the video data of the target video and the video material clips associated with the target video from the server;
  • the data is subjected to video analysis to obtain multiple video clips, wherein the video analysis includes mirroring processing and attribute analysis based on multiple preset clip attribute tags.
  • Each video clip in the multiple video clips corresponds to a clip attribute tag and a subsection.
  • the video clip based on the user portrait of the target user, determine the video template associated with the target user from the video template database, and obtain at least one predetermined template fragment and template tag sequence in the video template, and the template tag sequence is composed of at least one template fragment based on the template attribute label of at least one template segment and the segment attribute labels corresponding to multiple video segments, screening at least one video segment that matches the template attribute label of at least one template segment from the multiple video segments; according to The position of the template attribute tag of each template fragment in the at least one template fragment in the template tag sequence, obtained by splicing at least one matching video fragment;
  • the data output module is used for outputting video data and video material clips in the application display interface of the application client.
  • An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the method provided by the embodiments of the present application.
  • An aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, execute the method provided by the embodiments of the present application .
  • embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided by the embodiments of the present application.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a scenario for data interaction provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a video data processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a scene for querying video data provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a scenario for performing mirroring processing provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of extracting a segment attribute tag provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a scene for acquiring a video template provided by an embodiment of the present application.
  • FIG. 8A is a schematic diagram of a scene for performing video analysis on a template video according to an embodiment of the present application.
  • 8B is a schematic diagram of a scene for performing video analysis on a target video provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a video data processing method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a generation of video material clips provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a front-end and back-end interaction provided by an embodiment of the present application.
  • 12A is a schematic diagram of a scene of outputting video material clips according to an embodiment of the present application.
  • FIG. 12B is a schematic diagram of a scene for updating a video material segment provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a video data processing apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a video data processing apparatus provided by an embodiment of the present application.
  • 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 16 is a video data processing system provided by an embodiment of the present application.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping It also includes common biometric identification technologies such as face recognition and fingerprint recognition.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a service server 2000 and a cluster of user terminals.
  • the user terminal cluster may specifically include one or more user terminals, and the number of user terminals in the user terminal cluster will not be limited here.
  • the multiple user terminals may specifically include a user terminal 3000a, a user terminal 3000b, a user terminal 3000c, . . . , and a user terminal 3000n.
  • the user terminal 3000a, the user terminal 3000b, the user terminal 3000c, . . . , and the user terminal 3000n can be directly or indirectly connected to the service server 2000 through wired or wireless communication, so that each user terminal can be connected through the network. Data interaction with the service server 2000 is performed.
  • the service server 2000 shown in FIG. 1 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, Cloud servers for cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and basic cloud computing services such as big data and artificial intelligence platforms.
  • each user terminal in the user terminal cluster as shown in FIG. 1 can be integrated with an application client.
  • the application client runs in each user terminal, it can be respectively connected with the service shown in FIG. 1 above. Data exchange is performed between the servers 2000 .
  • the application client can be understood as an application that can load and display video data.
  • the application client here may specifically include: vehicle client, smart home client, entertainment client (eg, game client) , multimedia clients (eg, video clients), social clients, and information clients (eg, news clients), etc.
  • one user terminal may be selected from the multiple user terminals shown in FIG.
  • the target user terminal may include: a smart phone, a tablet computer, a notebook computer, a smart TV, etc. Smart terminal with video data loading function.
  • the user terminal 3000a shown in FIG. 1 may be used as the target user terminal.
  • the target user terminal can respond to the user Y's targeting of the video.
  • the trigger operation is to send a video playback request to the service server 2000 shown in FIG. 1 .
  • the service server 2000 can find the video data of the video in the video service database based on the video playback request, and then can associate the video data with one or more video material segments (for example, this video data) associated with the video data.
  • the target user terminal may also display the received video material clips when playing the video data.
  • the video material clips here can be obtained by the service server 2000 after screening the video clips of the target video according to the template clips of the selected video template and the template tag sequence corresponding to the template clips; It should be understood that the video clips here may be obtained after the service server 2000 performs video analysis on the video data; it should be understood that the video templates here may be created by the service server 2000 based on the user Y (that is, the target user) determined by the user profile.
  • videos eg, TV dramas or short videos
  • the target video videos (eg, TV dramas or short videos) selected by the above-mentioned user Y (that is, the target user) in the application client (eg, the video client K) that fit their own interests etc.) are collectively referred to as the target video.
  • the video material clips in this embodiment of the present application may be intelligently generated by the above-mentioned service server 2000 according to the above-mentioned template clips and template tag sequences of the above-mentioned video template.
  • the service server 2000 can intelligently generate one or more video clips of the target video (eg, TV drama S1 ) selected by the user Y in the target user terminal by using the video data processing method involved in the embodiment of the present application.
  • the generation process of the video material segment means that the service server 2000 can combine the tag information of the video segment of the TV drama S1 (that is, the segment attribute tag) with the tag information (that is, the template attribute tag) of the template segment of the template video (for example, the video M).
  • the above network framework is applicable to the field of artificial intelligence (ie, the field of AI), and the business scenarios corresponding to the AI field may be video classification scenarios, video recommendation scenarios, etc., and specific business scenarios will not be listed one by one here.
  • the video classification scene here mainly refers to that a computer device (for example, the above-mentioned service server 2000) can store video clips under the same video in the first service database by using the above-mentioned video data processing method.
  • a computer device for example, the above-mentioned service server 2000
  • the computer device may further convert the video material segment A1 and the video material segment
  • the material clip A2 is added to a corresponding short video recommendation database, where the short video recommendation database may at least include a first business database and a second business database.
  • the first service database here may be used to store one or more video material clips associated with the same video. For example, if the video material segment A1 and the video material segment A2 are both video segments of the same video (eg, video W), the video material segment A1 and the video material segment A2 may be added to the first service database corresponding to the video W. .
  • the video material segment A1 and the video material segment A2 belong to video segments of different videos, for example, if the target video corresponding to the video material segment A1 is the video W1 requested by the user Y1, the video material segment can be A1 is added to the first service database corresponding to this video W1; if the target video corresponding to the video material segment A2 is the video W2 requested by the user Y2, the video material segment A2 can be added to the first service corresponding to this video W2 database.
  • the second service database here may be used to store one or more video material clips associated with the same video template.
  • video clips using the same video template can be added to the second service database in video material clips of different videos.
  • the computer device may further add the video material segment A to the second service database corresponding to the video template B.
  • the video material clip A may be added to the second service database corresponding to the collection of expressions.
  • the video material segment A may be added to the second service database corresponding to the storyline highlight category.
  • the video material segment A may be added to the second service database corresponding to the mixed-cut collection of characters.
  • the computer device for example, the above-mentioned service server 2000
  • the computer device is intelligently generating the video material segments of the target video requested by the target user through the above-mentioned video data processing method (for example, based on the video template B1.
  • these video segments of the same video W that is, the video segments in the above-mentioned first service database, for example, the video material segment A1 and the video The material clip A2
  • the short video recommendation list for example, the short video recommendation list 1
  • the application client can also intelligently traverse and play these video clips in the above-mentioned short video recommendation list 1 for the above-mentioned target user.
  • the computer device for example, the above-mentioned service server 2000
  • the computer device may also output the video material clip A1 among the multiple video material clips in the short video recommendation list 1 to the application client, so as to realize the intelligent playback of the video material segment A1 in the application client.
  • the video material segment A1 and the template segment mapped to the above-mentioned video template B1 have similar video playback effects.
  • the computer device for example, the above-mentioned service server 2000
  • the computer device intelligently generates video material segments (for example, the video material segments A1 and A3 generated based on the video template B1) through the above-mentioned video data processing method
  • the These video clips using the same video template B1 that is, the video clips in the above-mentioned second business database, for example, video material clip A1 and video material clip A3
  • another short video recommendation list for example, short video recommendation list
  • the application client can also intelligently traverse and play the short video recommendation list 2 for the target user in the application client.
  • These video clips in.
  • the computer equipment for example, the above-mentioned service server 2000
  • the computer equipment can also recommend other video clips in the short video recommendation list 2 (For example, the video material segment A3) is output to the application client, so as to realize the intelligent playback of the video material segment A3 in the application client.
  • FIG. 2 is a schematic diagram of a data interaction scenario provided by an embodiment of the present application.
  • the server shown in FIG. 2 may be the service server 2000 in the embodiment corresponding to FIG. 1, and the user terminal X shown in FIG. 2 may be any one of the user terminal clusters in the embodiment corresponding to FIG. 1. user terminal.
  • the embodiment of the present application takes the user terminal 3000a shown in FIG. 1 as the user terminal X as an example to illustrate the specific process of recommending video material clips for target users when the service scenario is a video recommendation scenario.
  • the video recommendation interface 200a may include multiple recommended video data, and the multiple recommended video data here may specifically include the video data 20a, 20b, 20c, and 20d shown in FIG. 2 . It can be understood that, in this embodiment of the present application, the video data 20a, the video data 20b, the video data 20c, and the video data 20d displayed in the video recommendation interface 200a may be collectively referred to as recommended video data.
  • the video data 20b selected by the target user from the video recommendation interface 200a may be collectively referred to as the application display interface target video in .
  • the user terminal may send a video playback request to the server shown in FIG. 2 in response to the playback operation for the target video in the application display interface.
  • the server may respond to the video playback request to output the video playback interface corresponding to the target video in the application client, for example, the application client may output the video playback interface corresponding to the video data 20b, the video data 20b corresponding
  • the video playing interface may be the video playing interface 200b shown in FIG. 2 .
  • the application display interface may include a video playback interface 200b for playing the target video, and may also include a short video recommendation list for displaying video clips, and the short video recommendation list may include videos associated with the target video. footage.
  • the server when receiving the video playback request sent by the target user through the user terminal, can obtain the video ID of the target video from the video playback request, and query the video service database for the video of the target video according to the video ID. data. After querying the video data of the target video, the server may perform the above-mentioned video analysis on the video sequence of the video data to obtain the video segment of the video data.
  • the video segment here may specifically include the video segment 100a shown in FIG. The clips 100b, . . . , and the video clips 100k, where each video clip may correspond to a clip attribute tag.
  • the server can obtain a video template that fits the viewing interest of the target user, and then can obtain the template fragment mapped by the video template and the template tag sequence corresponding to the template fragment, so as to facilitate the operation of the target user.
  • the video clips matching each template clip can be screened from the above-mentioned video clips (that is, the video clips that meet the clip matching conditions), and then based on these screened video clips that meet the clip matching conditions, obtain Video footage.
  • video clips that have the same tag sequence characteristics as template fragments can be obtained as much as possible from these video clips, and then the above-mentioned video material clips can be obtained by filling according to the same tag sequence (ie, the above-mentioned template tag sequence).
  • the same tag sequence ie, the above-mentioned template tag sequence
  • one or more short videos of the above target video can be obtained
  • the user terminal can output the video material segment and the above video data to the application client.
  • a video template can correspond to one or more video material clips.
  • the number of video clips with the same tag sequence feature selected from the video clips of the target video will not be limited here. .
  • the embodiment of the present application takes one video template corresponding to one video material segment as an example. Then, when the server determines that there are multiple (for example, N) video templates that fit the viewing interest of the target user, this embodiment of the present application can also be used to intelligently generate N of the N video material clips Video templates are collectively referred to as video templates. It should be understood that, for the specific implementation of intelligently generating other video material segments through N video templates, reference may be made to the description of the specific process of intelligently generating the above-mentioned generating video material segments, which will not be repeated here.
  • FIG. 3 is a schematic flowchart of a video data processing method provided by an embodiment of the present application.
  • the method may be executed by the application client, may also be executed by the server, or may be executed jointly by the application client and the server.
  • the application client may be the application client running in the user terminal X in the embodiment corresponding to FIG. 2 above
  • the server may be the server in the embodiment corresponding to FIG. 2 above.
  • this embodiment is described by taking the method executed by the server as an example, to illustrate the specific process of generating the video material segment corresponding to the target video in the server based on the video template.
  • the method may include at least the following steps S101-S105:
  • Step S101 acquiring video data of a target video requested by a target user, and performing video analysis on the video data to obtain a plurality of video clips, wherein the video analysis includes mirroring processing and attribute analysis based on a plurality of preset clip attribute labels, Each video clip in the plurality of video clips corresponds to a clip attribute label and a mirroring clip.
  • the server may acquire video data of the target video requested by the target user and a network identification model associated with the video data. Further, the server may perform mirror segmentation processing on the video sequence corresponding to the video data through the video segmentation component to obtain multiple mirrored segments associated with the video sequence. Further, the server may input multiple mirroring fragments into the network recognition model, and the network recognition model performs attribute analysis on the plurality of mirroring fragments based on a plurality of preset fragment attribute labels, and obtains fragments corresponding to the plurality of mirroring fragments. attribute label. Further, the server may determine the multiple mirroring segments with segment attribute tags as multiple video segments of the video data. Wherein, one video clip may correspond to one clip attribute tag.
  • the server may receive a video playback request sent by the application client.
  • the video playback request is generated by the application client in response to a playback operation performed by the target user on the target video.
  • the server can extract the video identification of the target video from the video playback request, search for the service video data corresponding to the target video in the video service database based on the video identification, and use the found service video data as the target video in the application client. video data.
  • FIG. 4 is a schematic diagram of a scene for querying video data provided by an embodiment of the present application.
  • the application display interface 400a here may be the application display interface 200a in the embodiment corresponding to FIG. 2 above.
  • the application client may take the video data 40b as the target video, and send the video carrying the video data 40b to the server
  • the identified video playback request and then the server can receive the video playback request sent by the application client, obtain the video identification about the video data 40b carried by the video playback request, and based on the video identification in the video service database corresponding to the application client
  • the service video data corresponding to the video identifier is searched, and the searched service video data is used as the video data corresponding to the video data 40b.
  • the target video here can be a long video such as variety shows, movies, TV dramas, etc., or a short video intercepted from a long video, which is not limited in this application.
  • the server performs mirroring processing on the video sequence corresponding to the video data through the video segmentation component, and the specific process of obtaining multiple mirroring segments associated with the video sequence can be described as follows:
  • the first video frame used as the cluster centroid is determined in the video sequence by the video segmentation component, and the segmentation of the mirroring cluster to which the first video frame belongs is created.
  • Mirror cluster information (it can be understood that the mirror cluster information here may be an identifier of a corresponding configured mirror cluster).
  • the server may determine video frames other than the first video frame as the second video frame in the video sequence, and may sequentially acquire each second video frame based on a polling mechanism to determine each second video frame.
  • the server may classify the second video frame whose image similarity is greater than or equal to the clustering threshold into the first video frame.
  • the server may update the first video frame with the second video frame whose image similarity is less than the clustering threshold (for example, changing The second video frame is used as the updated first video frame), and the mirroring cluster information of another mirroring cluster to which the updated first video frame belongs is created, and then the updated first video frame can be sequentially associated with the unidentified mirroring cluster.
  • the matched second video frame is subjected to image similarity matching, until the video frames in the video sequence have completed the image similarity matching, the mirroring cluster information of the mirroring cluster to which the video frame in the video sequence belongs (that is, can be divided into The split cluster to which each video frame in the video sequence belongs).
  • the server may determine a mirroring segment associated with the video sequence based on the mirroring cluster information of the mirroring cluster to which the video frame in the video sequence belongs, that is, combine the video frames in the video sequence into multiple mirroring segments.
  • the image similarity matching refers to calculating the similarity of contents between two images, and the image similarity used to judge the similarity of the contents of the images can be obtained. If the image similarity is larger, it indicates that the two images are more similar, and if the image similarity is smaller, it indicates that the two images are less similar. Among them, the similarity of content between two images can be measured using different methods.
  • the picture in the case of using cosine similarity, can be represented as a vector, and the similarity between two pictures can be represented by calculating the cosine distance between the vectors;
  • the histogram can describe the global distribution of colors in an image, the histogram Image similarity is another image similarity calculation method;
  • structural similarity is a full-reference image quality evaluation index, which measures image similarity from three aspects: brightness, contrast, and structure. It should be understood that the present application does not limit the method specifically used in image similarity matching.
  • FIG. 5 is a schematic diagram of a scene for mirror splitting processing provided by an embodiment of the present application.
  • the video sequence shown in FIG. 5 may include multiple video frames, and may specifically include n video frames shown in FIG. 2, where n may be a positive integer greater than 1, and the n video frames may specifically include: video frame 10a, Video frame 10b, video frame 10c, video frame 10d, . . . , video frame 10n.
  • the image similarity between the video frames in the video sequence can be calculated by the clustering algorithm, so that the video frames in the video sequence can be divided into different groups based on the calculated image similarity between the video frames.
  • Cluster clusters ie mirror clusters).
  • k cluster clusters that is, k sub-mirror clusters
  • the k clusters can specifically include the cluster clusters 20a and the cluster clusters shown in FIG. 5 . 20b, ..., cluster clusters 20k.
  • each of the k clusters shown in FIG. 5 may include at least one video frame.
  • the present application may refer to the first video frame (ie, the video frame 10 a ) in the video sequence as the one that can be used as the cluster centroid (ie, the cluster centroid 1 ).
  • the first video frame, and the video frames other than the video frame 10a in the video sequence can be determined as the second video frame, and the second video frame (that is, the video frame 10b, the video frame 10c, the video frame 10c, ..., video frame 10n), to sequentially calculate the image similarity between the first video frame and the second video frame.
  • the present application can create a mirroring cluster (ie mirroring cluster 1) to which the cluster centroid 1 belongs, and then perform image similarity matching between the video frame 10b and the video frame 10a.
  • a mirroring cluster ie mirroring cluster 1
  • the video frame 10b corresponding to the similarity 1 is divided into the mirroring cluster to which the video frame 10a belongs.
  • the present application may divide the video frame 10c into the mirroring cluster (ie mirroring cluster 1) to which the video frame 10a belongs.
  • the present application can perform image similarity matching between the video frame 10d and the video frame 10a.
  • the image similarity eg, similarity 2
  • the first video frame is updated according to the video frame 10d, so that the video frame 10d corresponding to the similarity 2 is used as the updated
  • the first video frame and the new cluster centroid (ie, the cluster centroid 2 ), and the mirror cluster (ie, the mirror cluster 2 ) to which the cluster centroid 2 belongs can be created, and further, based on the above polling mechanism, it is possible to obtain the unidentified data in turn.
  • Matched second video frames ie, video frames 10e, . . . , video frames 10n
  • the present application may divide the video frame 10e into the mirroring cluster (ie mirroring cluster 2) to which the video frame 10d belongs.
  • the present application can use the same method to obtain the cluster centroid 3, the cluster centroid 4, ... and the cluster centroid k.
  • the application can use the same method to obtain the mirroring cluster 3 (that is, the clustering cluster 20c), Mirror cluster 4 (ie cluster cluster 20d), ..., mirror cluster k (ie cluster 20k).
  • the video frame 10a, the video frame 10b, . . . , and the video frame 10n in the video sequence have all completed the image similarity matching.
  • clustering processing ie mirroring processing
  • multiple clustering clusters ie mirroring clusters associated with the video sequence can be obtained, thereby
  • the video frames in each cluster can be formed into a mirrored segment, and then k mirrored segments shown in FIG. 5 can be obtained.
  • the video frame 10a, the video frame 10b and the video frame 10c in the cluster 20a can be formed into a mirroring segment corresponding to the mirroring cluster 1 (ie, the mirroring segment 1), and the video frame 10d in the cluster 20b can be formed
  • the video frame 10e constitutes the mirror segment corresponding to the mirror cluster 2 (ie, the mirror segment 2), ..., the video frame 10(n-2), the video frame 10(n-1) and the video frame 10(n-1) in the cluster 20k can be grouped together
  • the video frame 10n constitutes a segmented segment corresponding to the segmented cluster k (ie, segmented segment k).
  • the video segmentation component that divides the video sequence corresponding to the target video into multiple mirror segments can be the pyscenedetect open source code library, which is a tool for automatically dividing video data into individual segments.
  • the selection of a video frame (cluster centroids) may not be limited to the above-mentioned manner.
  • the method for dividing the video sequence corresponding to the target video into multiple mirror segments can also be a method of drum beat recognition, for example, acquiring the audio data of the target video, identifying the drum beats in the audio data, and according to the drum beats in the audio The position in the data is to determine the position of the drum dot in the video data of the target video, so as to divide the video sequence of the video data.
  • the method for dividing the video sequence into a plurality of mirror segments may also be in other manners, and the present application does not limit the video mirroring method specifically used for video mirroring.
  • the above-mentioned network identification model includes at least: a first network model with a first attribute label extraction function, a second network model with a second attribute label extraction function, and a third attribute label extraction function.
  • the server may input a plurality of mirror segments into the first network model, and perform a long-distance and near-view analysis on each of the plurality of mirror segments through the first network model, and obtain the far and near view labels of the multiple mirror segments. , taking the far and near-field labels of the multiple mirror segments as the first attribute labels output by the first network model, and using the mirror segments with the first attribute labels as the first type of mirror segments.
  • the server may input the mirrored segments of the first type into the second network model, and the second network model may perform face detection on each mirrored segment of the first type of mirrored segments to obtain a face detection result. Further, if the face detection result indicates that the face of the target character exists in the first type of storyboard segment, the server may use the storyboard segment corresponding to the face of the target character in the first type of storyboard segment as the second. Classifying the storyboard segment, determining the role tag to which the target character in the second type of storyboard segment belongs by using the second network model, and determining the role tag to which the target character belongs as the second attribute tag of the second type of storyboard segment.
  • the target character is one or more characters in the target video.
  • the server may determine the mirrored clips except the second type of mirrored clips in the first type of mirrored clips as the third type of mirrored clips, and input the third type of mirrored clips into the third network model,
  • the third network model performs scene detection on each segment in the first type of segment, to obtain a third attribute label of the third type of segment.
  • the server may determine, according to the first attribute label of the first type of mirrored clips, the second attribute label of the second type of mirrored clips, and the third attribute label of the third type of mirrored clips, among the multiple mirrored clips.
  • the segment attribute tag corresponding to each storyboard segment of .
  • the first network model may be a long-distance and near-view recognition model
  • the second network model may be a face recognition model
  • the third network model may be a scene recognition model.
  • the above-mentioned first network model, second network model and third network model may also be an expression recognition model, an action recognition model, etc., and the present application does not limit the specific types of network recognition models.
  • any deep learning model or machine learning model can be adopted as the network identification model, and the specific model used by the network identification model is not limited in this application. It should be understood that by using the pre-trained network identification model, the basic analysis capability of the target video can be improved, and then the video segment corresponding to the target video can be quickly obtained.
  • the face detection model can be used to perform face detection on the mirrored segment to obtain the face detection result, and then the character label corresponding to the face in the face detection result can be determined through the face recognition model.
  • the face detection model and the face recognition model here may be collectively referred to as the second network model.
  • face detection and face recognition can be collectively referred to as image detection.
  • Image detection means that machine learning technology can learn annotated sample data (for example, the correspondence between multiple annotation frames and label pairs in an image) to obtain a mathematical model.
  • the parameters of the mathematical model can be obtained to identify and predict Load the parameters of the mathematical model, and calculate the prediction frame of the physical label existing in the input sample and the probability that the prediction frame belongs to a physical label within the specified range, and then the physical label with the greatest probability can be used as the label corresponding to the prediction frame.
  • the far and near view labels ie, the first attribute label
  • the The scene tag ie, the third attribute tag
  • the storyboard segment when the storyboard segment is input into the second network model, all video frames in the shot point segment can be extracted, and face detection can be performed on these video frames, and then the detection
  • the feature vector of the obtained face is compared with the feature vector of the target character mentioned above. If the similarity result obtained from the feature vector comparison is greater than the threshold, the face is considered to be the target character, and the character label of the target character is used as the character label (ie, the second attribute label) of the mirroring segment where the detected face is located.
  • the present application can be used without knowing the label information of the mirrored segments (ie, without knowing any image noise information in advance)
  • directly input the mirror segment into the first network model that is, the above-mentioned far and near vision recognition model
  • the first attribute label corresponding to the mirror segment that is, automatically predict the sample noise level according to the model to give each A new label for each storyboard segment
  • the storyboard segments with the first attribute tag may be collectively referred to as the first type of storyboard segments.
  • the new label can be further automatically fed back to the subsequent model for learning, so as to prevent the above-mentioned network identification model from falling into a local area through dynamic noise prediction and processing. Optimal, to ensure that the model learning is carried out in the direction of better recognition effect.
  • the obtained first-type mirrored segments can also be reused when there is a new video template, and there is no need to repeat the video frame. identification and processing, which in turn can save computing resources.
  • the embodiment of the present application may further input the first type of mirrored clips into the second network model, so that the second network model can perform face detection on each of the first type of mirrored clips And face recognition, and then can select all the first type of lens fragments containing the face of the target character from these first type of lens fragments.
  • the selected first-type footage clips containing the face of the target character may be collectively referred to as the second-type footage clips.
  • the second network model can also be used to output the character label to which the target character in each second mirroring segment belongs.
  • the character tags to which the target characters belong here may be collectively referred to as the second attribute tags of the second type of mirroring clips.
  • the target character here may be one or more characters in the target video, and the number of target characters will not be limited here.
  • other mirroring fragments other than the second type of mirroring fragment in the first mirroring fragment may be collectively referred to as the third type of mirroring fragment, and then the third type of mirroring fragment can be input.
  • the third network model ie, the above-mentioned scene recognition model
  • the label information to which the mirrored segments belong can be corrected in real time through the trained network identification model, and further, each The clip property tag of the storyboard clip.
  • FIG. 6 is a schematic flowchart of extracting a segment attribute tag provided by an embodiment of the present application.
  • the video data shown in FIG. 6 can be the video data of the above-mentioned target video, and the specific process of obtaining the label information of the mirrored segment can be described as: performing video mirroring on the video sequence of the video data can obtain k mirrored segments, Further, each mirrored segment can be input into the network recognition model to obtain label information of each mirrored segment under the network recognition model.
  • the network recognition model here can be the scene recognition model shown in FIG. 6 (ie the third network model), the long-distance and near-field recognition model (ie the first network model), the face detection model and the face recognition model (ie the second network model) ).
  • the obtained far and near view labels (ie, the first attribute labels) corresponding to each sub-mirror segment may be: ⁇ Mirror 1: x1, Mirror 2: x2, ..., Mirror k: xk ⁇ , where x1 indicates that the far and near view labels corresponding to mirror 1 are x1, and x2 here indicates that the far and near view labels corresponding to mirror 2 are x2 , ..., where xk indicates that the far and near field labels corresponding to sub-mirror k are xk.
  • the above-mentioned far and near-view tags may include, but are not limited to: a long-range view, a close-up view of a character, a close-up of a character, a panoramic view of an object, a close-up of an object, and the like.
  • mirror 1, mirror 2, ..., mirror k here may be mirror segment 1, mirror segment 2, ..., mirror segment k in the above-mentioned embodiment corresponding to FIG. 5 .
  • the k lens segments can be input into the face detection model and the face recognition model.
  • the obtained character label (ie, the second attribute label) corresponding to each segment can be: ⁇ segment 1: y1, segment 2 : y2, mirror 4: y4, ..., mirror k-1: yk-1 ⁇ , where y1 indicates that the character label corresponding to mirror 1 is y1, and y2 here indicates that the character label corresponding to mirror 2 is y2,
  • y4 indicates that the character label corresponding to storyboard 4 is y4, ..., and yk-1 here indicates that the character label corresponding to storyboard k-1 is yk-1.
  • the above-mentioned role labels may include but are not limited to: single person, double person, etc.; the above-mentioned role labels may also include but are not limited to: male first, male second, female first, female second, little girl A, little boy B, etc. Among them, storyboard 3, storyboard 5, ..., storyboard k do not include character tags.
  • the segments that have not obtained detection or recognition results can be Fragment input scene recognition model.
  • the obtained scene label ie, the third attribute label
  • the obtained scene label corresponding to each segment can be: ⁇ segment 3: z3, segment 5: z5 , ..., split k: zk ⁇ , where z3 indicates that the scene label corresponding to split 3 is z3, where z5 indicates that the scene label corresponding to split 5 is z5, ..., where zk indicates the scene corresponding to split k
  • the label is zk.
  • the above scene tags may include, but are not limited to, natural scenes, indoor scenes, character buildings, bamboo forests, riversides, amusement parks, and the like.
  • the segment of the segment can be jointly described by using the far and near view tags and role tags, or the far and near view tags and scene tags of the segment segment. attribute label.
  • the far and near view tags and role tags of the segment 1 can be used to jointly describe the segment attribute tag (ie segment attribute tag 1) of the segment 1, for example
  • the far and near view labels corresponding to the storyboard 1 are long shots (that is, x1 is the long view), and the character label corresponding to the storyboard 1 is male one (that is, y1 is the male one), then the segment attribute label 1 corresponding to the storyboard 1 can be: ⁇ Vision, male one ⁇ .
  • Step S102 based on the user portrait of the target user, determine the video template associated with the target user from the video template database, and obtain at least one template segment and template tag sequence predetermined in the video template, and the template tag sequence is determined by the Template attribute tags of at least one template fragment.
  • the server may acquire the behavior log table of the target user, and extract behavior data information associated with the target user from the behavior log table. Further, the server may perform user portrait analysis on the behavior data information to obtain a user portrait for representing the target user, and based on the user portrait of the target user, determine a video template associated with the target user from the video template database.
  • the video template may carry a template tag sequence formed by template attribute tags of template segments, the template segments are obtained after performing video analysis on the template videos, and the template videos are determined by behavior data information.
  • the video template database may be on the server, or on other devices independent of the server. Further, the server may acquire at least one template segment included in the video template and a template tag sequence composed of template attribute tags of the at least one template segment. It can be understood that, in this embodiment of the present application, the behavior logs of different users in the application client obtained by the server within the target duration may be collectively referred to as a behavior log table.
  • the behavior data information is used to record behavior interaction data (visit, browse, search, click, etc.) generated each time the target user accesses the application client.
  • the behavioral interaction data here may specifically include the type of video the target user visits, the time for viewing the video, the number of times of viewing the video, the record of searching for the video, the number of times the video is clicked, as well as the target user's favorite videos, recommended videos, and liked videos. , purchased videos, coin-operated videos, etc.
  • FIG. 7 is a schematic diagram of a scene for acquiring a video template provided by an embodiment of the present application.
  • the log management system 70 shown in FIG. 7 may specifically include multiple databases, and the multiple databases may specifically include databases 70a, 70b, . . . , and databases 70n shown in FIG. 7 .
  • This means that the log management system 70 can be used to store the behavior logs of different users in the application client.
  • database 70a may be used to store the behavior log of user Y1 (not shown in the figure)
  • database 70b may be used to store the behavior log of user Y2 (not shown in the figure)
  • . . . , the database 70n may be used to store the behavior log of user Y2 (not shown in the figure) Behavior log of user Yn (not shown on the figure).
  • the server can obtain the behavior log table of the target user within the target duration in the database 10a, and can further obtain the behavior data in the behavior log table information. It should be understood that after acquiring the behavior data information of the target user, the server may perform a user portrait analysis on the behavior data information within the target time period, so as to obtain a user portrait used to characterize the target user.
  • the user portrait here may include the target user's liking for a certain video type, and the server may further select a video template of this video type as the video template associated with the target user.
  • the user portrait here may include the target user's liking for a certain video, and the server may then select a video template corresponding to this video as a video template associated with the target user.
  • the template data corresponding to the video template here may be data of the same video type as the video data of the target video.
  • the target video is an animation
  • a video template associated with the target video may be selected from the animation video templates.
  • a video template associated with the target video may be selected from the video templates of the live-action drama category. In this way, the best video template can be selected for the target video, and the display effect of the video clip can be improved.
  • the log management system 70 shown in FIG. 7 can establish a behavior log table for the target user accessing the application client within a single behavior recording period (for example, days are the unit of the behavior recording period). .
  • the log management system 70 may create a behavior log table for the target user when it is detected that the target user accesses the application client for the first time on that day.
  • the access timestamp of the client currently accessing the application (for example, time T1) is recorded in the behavior log table. This means that there is no other behavior interaction data before the current T1 time in this behavior log table.
  • the log management system 70 may add the behavior log table (for example, the behavior log table 1) established for the target user to the corresponding database (for example, FIG.
  • the database 10a) shown is stored.
  • the log management system 70 may add the behavior log table (for example, the behavior log table 2 ) corresponding to the T2 time to the corresponding database (for example, the behavior log table 2 ) when the access timestamp of the target user is at another time (for example, the T2 time).
  • the database 10a) shown in FIG. 7 is stored.
  • the log management system 70 may record the target user and the application client in the behavior log table of the recording period. interaction between the endpoints.
  • the target duration here may specifically include: one or more recording periods. Therefore, the behavior log table of the target user obtained by the server during the above-mentioned target duration (ie, multiple recording periods before the current access to the application client) may specifically include the above-mentioned behavior log table 1 and the above-mentioned behavior log table 1. and Behavior Log Table 2.
  • Step S103 based on the template attribute tag of the at least one template segment and the segment attribute tags corresponding to the multiple video segments, screen at least one video segment matching the template attribute tag of the at least one template segment from the multiple video segments.
  • Step S104 according to the position of the template attribute label of each template segment in the at least one template segment in the template label sequence, splicing at least one matched video segment as a video material segment of the target video.
  • the server may, based on at least one template segment and the template tag sequence, filter video segments that satisfy the segment matching condition from among multiple video segments, and use the video segments that satisfy the segment matching condition as the video material segments of the target video .
  • the server may use the N template fragments as the target template fragments, and determine the queue position of the target template fragments (the position or order of the target template fragments in the queue formed by the N template fragments) as the target queue position in the template tag sequence , and determine the template attribute label corresponding to the target queue position as the target template attribute label.
  • the number of template fragments may be N, where N may be a positive integer greater than 1. Therefore, a template tag sequence may contain N sequence positions, one sequence position corresponds to one template attribute tag, and one template attribute tag corresponds to one template fragment.
  • the server may filter the segment attribute tags that match the target template attribute tags among the segment attribute tags corresponding to the multiple video segments, and determine one or more video segments corresponding to the filtered segment attribute tags as candidate videos.
  • the server may perform similarity analysis between each candidate video segment and the target template segment in the candidate video segments to obtain a similarity threshold between each candidate video segment and the target template segment.
  • the maximum similarity threshold is determined in the similarity threshold, and the candidate video segment corresponding to the maximum similarity threshold is determined as the target candidate video segment matched with the target template segment.
  • the server may, based on the target queue position of the target template fragment in the template tag sequence, determine the target tag sequence formed by the segment attribute tags corresponding to the target candidate video fragment, and assign all target candidate videos associated with the target tag sequence to the target tag sequence.
  • the segments are subjected to splicing processing as the video material segments, that is, video material segments that satisfy the segment matching condition are determined according to all target candidate video segments associated with the target tag sequence.
  • the target tag sequence formed by the segment attribute tags of the video material segments is the same as the template tag sequence.
  • the similarity analysis can represent the scene similarity between the candidate video segment and the target template segment.
  • the candidate video segment is input into the third network model, and the candidate feature vector corresponding to the candidate video segment can be obtained.
  • the target template segment Inputting the target template segment into the third network model, the target feature vector corresponding to the target template segment can be obtained.
  • the similarity between the candidate video segment and the target template segment ie, the above similarity threshold
  • the third network model is a scene recognition model
  • the similarity here may represent the scene similarity.
  • the similarity analysis can also represent the similarity between the candidate video segment and the target template segment in the near and far view, and the similarity analysis can also represent the character similarity between the candidate video segment and the target template segment.
  • the target template segment can be input into the third network model to obtain the target feature vector of the target template segment.
  • the two candidate video clips may specifically include: candidate video clip 1 and candidate video clip 2, wherein these two candidate video clips are input into the third network model to obtain candidate video clip 1 candidates Feature vector 1 and candidate feature vector 2 of candidate video segment 2.
  • the candidate video segment 2 corresponding to the candidate feature vector 2 can be used as the target candidate video segment matching the target template segment.
  • the similarity analysis may also represent the duration relationship between the candidate video segment and the target template segment, and the present application does not specifically limit the calculation method of the similarity analysis.
  • FIG. 8A is a schematic diagram of a scene of performing video analysis on a template video provided by an embodiment of the present application
  • FIG. 8B is a video analysis of a target video provided by an embodiment of the present application. Schematic diagram of the scene.
  • N template segments shown in FIG. 8A can be obtained, where N can be a positive integer greater than 1.
  • the 4 template fragments may include: template fragment 80a, template fragment 80b, template fragment 80c and template fragment 80d.
  • the template attribute label corresponding to template fragment 80a is ⁇ far view ⁇
  • the template attribute label corresponding to template fragment 80b is ⁇ character close-up ⁇
  • the template attribute label corresponding to template fragment 80c is ⁇ character close-up ⁇
  • the template attribute corresponding to template fragment 80d The label is ⁇ object close-up ⁇ .
  • M video segments shown in FIG. 8B can be obtained, where M may be a positive integer greater than 1.
  • the 8 video clips may include: video clip 800a, video clip 800b, video clip 800c, video clip 800d, video clip 800e, video clip 800f, video clip 800g, and video clip 800h.
  • the segment attribute label corresponding to the video segment 800a is ⁇ far view ⁇
  • the segment property label corresponding to the video segment 800b is ⁇ person close-up ⁇
  • the segment property label corresponding to the video segment 800c is ⁇ far view ⁇
  • the segment property label corresponding to the video segment 800d is ⁇ person close-up ⁇
  • the clip attribute label corresponding to video clip 800e is ⁇ person close-up ⁇
  • the clip attribute label corresponding to video clip 800f is ⁇ far view ⁇
  • the clip attribute label corresponding to video clip 800g is ⁇ object close-up ⁇
  • the video clip 800h The corresponding fragment attribute tag is ⁇ character close-up ⁇ .
  • the queue position of the target template fragment 1 may be position 1 (ie, the target template fragment 1).
  • the target queue position is position 1)
  • the template attribute label of the target template fragment may be ⁇ Vision ⁇ (that is, the target template attribute label is ⁇ Vision ⁇ ).
  • the clip attribute label that matches the attribute label of the target template is ⁇ vision ⁇
  • the video clips corresponding to ⁇ vision ⁇ are video clip 800a, video clip 800c, and video clip 800f
  • the target The candidate video segments corresponding to the template segment 1 are the video segment 800a, the video segment 800c, and the video segment 800f.
  • the video segment 800a is determined as A target candidate video segment (eg, target candidate video segment 1 ) that matches the target template segment 1 .
  • the queue position of the target template fragment 2 may be position 2 ( That is, the target queue position is position 2), and the template attribute label of the target template fragment may be ⁇ character close-up ⁇ (that is, the target template attribute label is ⁇ character close-up ⁇ ). From the 8 video clips shown in FIG. 8A as the target template fragment (for example, the target template fragment 2), the queue position of the target template fragment 2 may be position 2 ( That is, the target queue position is position 2), and the template attribute label of the target template fragment may be ⁇ character close-up ⁇ (that is, the target template attribute label is ⁇ character close-up ⁇ ). From the 8 video clips shown in FIG.
  • the clip attribute label that matches the attribute label of the target template is ⁇ character close-up ⁇
  • the video clip corresponding to ⁇ character close-up ⁇ is video clip 800h
  • the video clip 800h is determined to be the same as
  • the target candidate video segment eg, target candidate video segment 2
  • the queue position of the target template fragment 3 may be position 3 ( That is, the target queue position is position 3), and the template attribute label of the target template segment may be ⁇ person close-up ⁇ (that is, the target template attribute label is ⁇ person close-up ⁇ ).
  • the clip attribute label matching the target template attribute label is ⁇ person close-up ⁇
  • the video clips corresponding to ⁇ person close-up ⁇ are video clip 800d and video clip 800e
  • the target template clip 3 The corresponding candidate video segments are the video segment 800d and the video segment 800e.
  • the video segment 800e is determined as A target candidate video segment (eg, target candidate video segment 3 ) that matches the target template segment 3 .
  • the queue position of the target template fragment 4 may be position 4 (ie, the target template fragment 4).
  • the target queue position is position 4)
  • the template attribute label of the target template segment may be ⁇ object close-up view ⁇ (that is, the target template attribute label is ⁇ object close-up view ⁇ ).
  • the clip attribute label matching the target template attribute label is ⁇ object close-up ⁇
  • the video clip corresponding to ⁇ object close-up ⁇ is video clip 800g
  • the video clip 800g is determined to be the same as
  • the target candidate video segment eg, target candidate video segment 4
  • target candidate video clip 1 corresponding to position 1 is video clip 800a
  • target candidate video clip 2 corresponding to position 2 is video clip 800h
  • target candidate video clip 3 corresponding to position 3 is video clip 800e
  • target candidate video clip 3 corresponding to position 4 is video clip 800e.
  • the video clip 4 is the video clip 800g
  • the video material clip can be determined from the video clip 800a, the video clip 800h, the video clip 800e and the video clip 800g based on the position 1, the position 2, the position 3 and the position 4.
  • the template tag sequence is a sequence composed of template attribute tags corresponding to the template fragments, and the template tag sequence here can be expressed as ⁇ vision, close-up of people, close-up of people, close-up of objects ⁇ ; the target tag sequence is a video clip that matches the template fragment.
  • the target template segment 1 may have a similar video playback effect with the target candidate video segment 1
  • the target template segment 2 may have a similar video playback effect with the target candidate video segment 2
  • the target template segment 3 may have a similar video playback effect with the target candidate video segment 2.
  • the candidate video segment 3 has similar video playback effects
  • the target template segment 4 may have similar video playback effects as the target candidate video segment 4. Therefore, the video material segment may have the same video playback effect as the template segment.
  • the server may perform video splicing processing on all target candidate video segments associated with the target tag sequence to obtain spliced video data associated with the N template segments. Further, the server may acquire template audio data associated with the N template segments, and perform audio and video merging processing on the template audio data and the spliced video data through the audio-video synthesis component to obtain video material segments that satisfy the segment matching conditions.
  • the tool for performing video splicing processing on each target candidate video segment and performing audio and video merging processing on template audio data and spliced video data may be the same tool, and this tool may be the above-mentioned audio and video synthesis component.
  • the audio and video synthesis component here can be the ffmpeg tool or other third-party software tools with video decapsulation capability.
  • the video decapsulation components will not be exemplified one by one here.
  • Step S105 Push the video data and the video material clips to the application client corresponding to the target user, so that the application client outputs the video data and the video material clips.
  • the application client can play the video data and the video material clips in the application display interface.
  • the application client when the application client is playing the video data, the application client can also be used to display a thumbnail of each video material segment.
  • the specific implementation form for outputting video material clips by the application client will not be limited here.
  • the server when acquiring the video data of a certain video requested by the target user, the server may perform video analysis on the video data to obtain one or more video segments of the video data.
  • the video analysis involved in the embodiments of the present application mainly includes: video mirroring and attribute analysis.
  • video mirroring mainly means that the video data can be divided into one or more mirroring segments, so that the server can further perform attribute analysis on the segment content of each mirroring segment to obtain the segment of each mirroring segment.
  • attribute tags so that the storyboard segments with the segment attribute tags are collectively referred to as the aforementioned video segments, and it should be understood that one video segment may correspond to one segment attribute tag.
  • the server can quickly determine the video template associated with the target user according to the user portrait, and then obtain the template fragment mapped by the video template (for example, a popular short video template).
  • the video template for example, a popular short video template.
  • video for example, a popular short video template.
  • the video segments that satisfy the segment matching conditions are intelligently screened in the video segments, so that the filtered video segments that satisfy the segment matching conditions can be used as the video material segments of the target video.
  • the target tag sequence formed by the segment attribute tags of the video material segment here may be the same as the template tag sequence, so as to ensure that the video material segment and the template segment have the same video playback effect.
  • the server can intelligently push the above-mentioned video data and video material clips to the application client corresponding to the target user, so that the application client can output the video data and video material clips.
  • one or more video clips carrying clip attribute tags can be quickly obtained through video analysis (eg, video mirroring and attribute analysis, etc.).
  • video analysis eg, video mirroring and attribute analysis, etc.
  • these video clips can be screened according to the template tag sequence of these video templates, so as to quickly obtain the video clips that have the same characteristics as the video template.
  • Video clips with similar video playback effects can be quickly synthesized to obtain video material clips (for example, short videos that can be pushed to target users can be quickly obtained), and as video templates are added and updated, the points of these video clips can be reused.
  • Mirror and attribute information reduce the identification and processing of video frames in the target video, improve the generation efficiency of short videos, save the computing cost of continuously generating and distributing a large number of short videos for different users, and save the computing resources of the server.
  • FIG. 9 is a schematic flowchart of a video data processing method provided by an embodiment of the present application.
  • the method may be jointly executed by an application client and a server, the application client may be the application client running in the user terminal X in the embodiment corresponding to FIG. 2 above, and the server may be the above-mentioned FIG. 2 The server in the corresponding embodiment.
  • the method may include the following steps:
  • Step S201 the application client can respond to the playback operation performed by the target user for the target video, generate a video playback request for requesting playback of the target video, and send the video playback request to the server;
  • the video playback request may carry a video identifier of the target video, where the video identifier is used to instruct the server to obtain the video data of the target video requested by the target user to be played.
  • the playback operation may include contact operations such as clicking, long pressing, and sliding, and may also include non-contact operations such as voice and gesture, which are not limited in this application.
  • Step S202 the server obtains the video data of the target video requested by the target user, and performs video analysis on the video data to obtain a plurality of video clips, wherein the video analysis includes mirroring processing and attribute analysis based on a plurality of preset clip attribute tags , each video clip in the plurality of video clips corresponds to a clip attribute label and a mirroring clip;
  • Step S203 the server determines the video template associated with the target user from the video template database based on the user portrait of the target user, and obtains at least one template segment and template tag sequence predetermined in the video template, and the template tag sequence is determined by all.
  • the template attribute tag of the at least one template fragment is formed;
  • Step S204 the server screens at least one video clip that matches the template attribute label of the at least one template clip from the multiple video clips based on the template attribute label of the at least one template clip and the clip attribute labels corresponding to the multiple video clips;
  • Step S205 the server splices the matching at least one video clip according to the position of the template attribute tag of each template fragment in the at least one template fragment in the template tag sequence, as the video material fragment of the target video;
  • Step S206 the server pushes the video data and the video material clips to the application client corresponding to the target user;
  • FIG. 10 is a schematic flowchart of a generation of video material clips provided by an embodiment of the present application.
  • the server can perform video analysis on the wonderful short video when acquiring the wonderful short video (ie, the template video), so as to obtain one or more video clips of the wonderful short video, and then the wonderful short video can be One or more video clips of the video as template clips.
  • the video analysis involved in the embodiments of the present application mainly includes: video mirroring and attribute analysis.
  • the video mirroring mainly refers to that the video data of the wonderful short video can be divided into one or more mirroring segments.
  • the server can further perform attribute analysis on the segment content of each mirroring segment (that is, extracting the mirroring information), so as to obtain the template attribute label of each mirroring segment (that is, the scene label, the character label (that is, the scene label and the character label shown in FIG. Role tags) and lens tags), so that the storyboard segments with template attribute tags are collectively referred to as the aforementioned template segments, so that popular highlight sequences (ie shot sequence records) can be determined based on the template attribute tags.
  • one template fragment may correspond to one template attribute tag.
  • the popular collection sequence 1 in the collection sequence library shown in FIG. 10 may be the template attribute tag corresponding to the template fragment 1
  • the popular collection sequence 2 may be the template attribute tag corresponding to the template fragment 2
  • the popular collection sequence 3 may be the template fragment. 3 The corresponding template attribute label.
  • a template segment of a template video ie, the above-mentioned wonderful short video
  • a template tag sequence of the template segment ie, the above-mentioned perfect short video
  • template audio data ie, music
  • the server when the server obtains the TV series (ie, the target video), the server may perform video mirroring and attribute analysis on the TV series to obtain one or more video segments of the TV series. It should be understood that one video clip may correspond to one clip attribute tag.
  • the server can obtain one or more popular highlight sequences (ie, sequence samples) from the highlight sequence library, and then can determine the template segment and the template tag sequence corresponding to the template segment according to the selected popular highlight sequence, and perform the processing on the video segments of the target video.
  • Screening and sorting to obtain the screened video clips that is, the sequence of fragmented mirrors based on material matching
  • the spliced video data composed of these screened video clips and the template audio data of the template clips intelligently generate and match the video clips.
  • the accumulation of video templates for consecutive days can be realized.
  • One or more video clips with corresponding styles are generated from the TV series according to the video template, which can enrich the styles of the finally generated video clips.
  • a TV series can generate various styles of video clips based on multiple video templates, which can be recommended by thousands of people in the video recommendation scene.
  • Video analysis and video matching of short videos and TV series can achieve the goal of automatic analysis.
  • only limited migration capability is required to complete the analysis of the TV series, so that the difficulty of generating video material clips of the new TV series will be low, and the method of generating video material clips will be highly transferable.
  • Step S207 the application client outputs video data and video material clips in the application display interface.
  • the application client can receive the video data of the target video returned by the server based on the video playback request, and the video material clips associated with the target video, and can determine the video for playing the video data in the application display interface of the application client Playing interface, and then can play video data in the video playing interface. Further, the application client may respond to the triggering operation on the application display interface, and play the corresponding video material clip in the application display interface of the application client.
  • the triggering operation may include contact operations such as clicking, long pressing, and sliding, and may also include non-contact operations such as voice and gesture, which are not limited in this application.
  • the application client can also display the thumbnail of each video material clip in the application display interface, or dynamically play each video in the application display interface.
  • the animation of clips, the specific display form of these clips will not be limited here.
  • FIG. 11 is a schematic flowchart of a front-end and back-end interaction provided by an embodiment of the present application.
  • the above application client can run on the front end B shown in FIG. 11 .
  • the play operation performed by the target user on the target video (for example, the video of interest to the target user) in the application client of the front end B is to input the target video for the front end B.
  • the server ie, the backend
  • the server can return the video data of the target video and one or more video material clips associated with the target video (for example, video footage of the video, etc.) to the front-end B, that is, display them in the application display interface of the front-end B Video data and video clips returned by the server.
  • the video template here may be determined by the server based on the user portrait of the target user.
  • the front end A may be another user terminal corresponding to the video editor.
  • the video editor can select one or more video clips as template clips from the video clips obtained by the video analysis, and then can determine the video template based on these template clips (i.e. Dig for great video templates).
  • the front-end A may receive the input of the wonderful short video, and then upload the video template (ie, the wonderful video template) corresponding to the wonderful short video to the server for saving (ie, the back-end storage).
  • front-end B and the front-end A may also be the same user terminal, that is, the front-end B (or the front-end A) may be the input side of the wonderful short video or the input side of the target video.
  • FIG. 12A is a schematic diagram of a scene of outputting a video material clip provided by an embodiment of the present application.
  • the application display interface 120a here may be the application display interface in the embodiment corresponding to FIG. 2 above.
  • the application display interface 120a may include a video playing interface 1 for playing the target video, and may also include a short video recommendation list (eg, short video recommendation list 1) for displaying or playing video clips.
  • the short video recommendation list 1 may at least include video material segments associated with the target video.
  • the video material segment here may be the video material segment associated with the target video in the above-mentioned first service database.
  • the application client can display or play the video material in the above-mentioned short video recommendation list 1 in the highlight recommendation part of the application display interface 120b Fragment.
  • the application client when the application client plays the target video in the video playback interface 1 , the application client can also traverse and play (or play simultaneously) the video material clips in the short video recommendation list 1 .
  • the video recommendation list 1 may specifically include N video material segments associated with the target video.
  • the N video material segments here may specifically be the three video material segments shown in FIG. 12A .
  • the three video material segments may specifically include: video material segment A1, video material segment A2, and video material segment A3.
  • the application client can display or play the above-mentioned short video recommendation in the highlight recommendation part of the application display interface 120b.
  • the video material segments in the list 1, for example, the video material segment A1, the video material segment A2, and the video material segment A3 in the application display interface 120b.
  • FIG. 12B is a schematic diagram of a scene for updating a video material segment provided by an embodiment of the present application.
  • the server may combine the video data (eg, video data J) of this video material segment A1 with this One or more video clips associated with video data J (eg, video clip C1, video clip C2, and video clip C3) are returned to the application client for playback of this video data J in the application client.
  • the application client can also display the received video material segments when playing the video data J of the video material segment A1 to obtain the application display interface 120c.
  • the application display interface 120c here may include a video playing interface 2 for playing video data J, and may also include a short video recommendation list (eg, short video recommendation list 2) for displaying video material clips.
  • the short video recommendation list 2 may at least include video material segments associated with the video data J.
  • the application client can display or play the above short video recommendation list in the highlight recommendation part of the application display interface 120d 2 of the video footage.
  • the video material segment here may be a video material segment that has the same video template as the video material segment A1 in the above-mentioned second service database. As shown in FIG.
  • the short video recommendation list 2 may specifically include M video material segments associated with the video data J.
  • the M video material clips here may specifically be the three video material clips shown in FIG. 12B .
  • the three video material segments may specifically include: video material segment C1, video material segment C2, and video material segment C3.
  • the application client can display or play the videos in the above-mentioned short video recommendation list 2 in the highlight recommendation part of the application display interface 120d.
  • the clips for example, the video clip C1, the video clip C2, and the video clip C3 in the application display interface 120d.
  • the application client can also intelligently traverse and play these video material clips in the above-mentioned short video recommendation list 2 for the above-mentioned target user.
  • the server may also output the video material clip C1 among the multiple video material clips in the short video recommendation list 2 to the application client, so as to The application client implements intelligent playback of the video material clip C1.
  • the application client may also record the current playback progress (for example, time T) of the target video when updating the video data played in the video playback interface 1 of the application client to the video material segment A1, to After the video material segment A1 is played, the target video continues to be played from the time T of the target video.
  • time T the current playback progress
  • the application client can dynamically adjust the position of the video material clips in the short video recommendation list in real time according to the current playback progress of the target video, so as to recommend video material clips in different orders for the target user. For example, if all the video clips that make up the video clip are included before the current playback progress, that is, all the video clips that make up the video clip have been watched at the current moment, the video clip can be arranged in front of the short video recommendation list , that is, to realize the playback of the plot.
  • the application client may further sort the video material clips in the video recommendation list according to the playing times of the current video material clips on the application clients in other user terminals.
  • the video clip can be preferentially recommended for the target user, that is, the video clip is arranged in the front of the short video recommendation list.
  • one or more video clips carrying clip attribute tags can be quickly obtained.
  • these video clips when one or more video templates are accurately determined according to the user portrait, these video clips can be intelligently screened according to the template tag sequence of these video templates, so as to quickly obtain the video Templates have video clips with similar video playback effects, so that video clips can be quickly synthesized (for example, short videos that can be displayed to target users can be quickly obtained), and these video clips can be reused as video templates are added and updated. It can reduce the identification and processing of video frames in the target video, improve the generation efficiency of short videos, save the computing cost of continuously generating and distributing a large number of short videos for different users, and save the computing resources of the server.
  • FIG. 13 is a schematic structural diagram of a video data processing apparatus provided by an embodiment of the present application.
  • the video data processing apparatus 1 may include: a segment generating module 30 , a template obtaining module 40 , a material determining module 50 , and a data sending module 60 ; further, the video data processing apparatus 1 may further include: a request receiving module 10 , and a data searching module 20 .
  • the segment generation module 30 is used for acquiring the video data of the target video requested by the target user, and performing video analysis on the video data to obtain multiple video segments, wherein the video analysis includes mirroring processing and segment attribute tags based on multiple presets attribute analysis, each video clip in the multiple video clips corresponds to a clip attribute label and a mirroring clip.
  • the segment generation module 30 includes: a model acquisition unit 301 , a mirror acquisition unit 302 , a label determination unit 303 , and a segment determination unit 304 .
  • a model obtaining unit 301 configured to obtain the video data of the target video requested by the target user and the network identification model associated with the video data;
  • a mirroring acquisition unit 302 configured to perform mirroring processing on a video sequence corresponding to the video data through a video segmentation component to obtain a plurality of mirroring segments associated with the video sequence;
  • the mirror acquisition unit 302 includes: a component acquisition subunit 3021, an image matching subunit 3022, a mirror creation subunit 3023, a matching completion subunit 3024, and a mirror determination subunit 3025;
  • the component acquisition subunit 3021 is used to determine the first video frame used as the cluster centroid in the video sequence by the video segmentation component, and create the mirror cluster information of the mirror cluster to which the first video frame belongs;
  • the image matching subunit 3022 is configured to determine a video frame other than the first video frame as a second video frame in the video sequence, obtain each second video frame in the second video frame in turn based on a polling mechanism, and determine the image similarity between each second video frame and the first video frame;
  • the mirror creation subunit 3023 is used to divide the second video frame whose image similarity is greater than or equal to the clustering threshold into the first video frame if the image similarity between the first video frame and a second video frame is greater than or equal to the clustering threshold.
  • the matching completion subunit 3024 is used to update the first video frame with the second video frame whose image similarity is less than the clustering threshold if the image similarity between the first video frame and a second video frame is less than the clustering threshold, and create an update The mirroring cluster information of the mirroring cluster to which the first video frame belongs, and the image similarity matching is performed on the updated first video frame and the unmatched second video frame in turn, until the video frames in the video sequence are all completed.
  • the similarity is matched, obtain the mirroring cluster information of the mirroring cluster to which the video frame in the video sequence belongs;
  • the mirroring determination subunit 3025 is configured to form a plurality of mirroring segments from the video frames in the video sequence based on the mirroring cluster information of the mirroring cluster to which the video frames in the video sequence belong.
  • the component acquisition subunit 3021 for the specific implementation of the component acquisition subunit 3021, the image matching subunit 3022, the mirror creation subunit 3023, the matching completion subunit 3024, and the mirror determination subunit 3025, please refer to the steps in the embodiment corresponding to FIG. 3 above. The description of S101 will not be repeated here.
  • the label determination unit 303 is configured to input a plurality of mirror segments into the network identification model, and the network recognition model performs attribute analysis on the plurality of mirror segments based on the plurality of preset segment attribute labels to obtain a plurality of mirror segments The fragment attribute tag corresponding to the fragment.
  • the network identification model includes at least: a first network model with a first attribute label extraction function, a second network model with a second attribute label extraction function, and a third network model with a third attribute label extraction function.
  • the label determination unit 303 includes: a first analysis subunit 3031, a face detection subunit 3032, a second analysis subunit 3033, a third analysis subunit 3034, and a label analysis subunit 3035;
  • the first analysis subunit 3031 is used to input multiple mirror segments into the first network model, and perform long-distance and near-field analysis on each of the multiple mirror segments through the first network model to obtain multiple mirror segments
  • the far and near view labels of the multiple mirror segments are used as the first attribute labels output by the first network model, and the mirror segments with the first attribute labels are used as the first type of mirror segments;
  • the face detection subunit 3032 is used for inputting the first type of mirrored fragments into the second network model, and the second network model performs face detection on each mirrored fragment in the first type of mirrored fragments to obtain a face detection result;
  • the second analysis subunit 3033 is configured to, if the face detection result indicates that the face of the target character exists in the first type of storyboard segment, the storyboard corresponding to the face of the target character exists in the first type of storyboard segment
  • the clip is used as a second-type mirroring clip, the role label to which the target character in the second-type mirroring clip belongs is determined by the second network model, and the role label to which the target character belongs is determined as the second attribute label of the second-type mirroring clip ;
  • the target character is one or more characters in the target video;
  • the third analysis subunit 3034 is configured to determine the mirrored clips except the second type of mirrored clips in the first type of mirrored clips as the third type of mirrored clips, and input the third type of mirrored clips into the third type of mirrored clips
  • Three-network model, the third network model performs scene detection on each of the first-category segmented segments, and obtains a third attribute label of the third-category segmented segment;
  • the label analysis subunit 3035 is configured to determine a plurality of labels according to the first attribute label of the first type of mirror clips, the second attribute label of the second type of mirror clips, and the third attribute label of the third type of mirror clips The clip attribute tag corresponding to each clip in the clip.
  • step S101 The specific implementation of the first analysis subunit 3031, the face detection subunit 3032, the second analysis subunit 3033, the third analysis subunit 3034, and the label analysis subunit 3035 can be referred to in the embodiment corresponding to FIG. 3 above. The description of step S101 will not be repeated here.
  • the segment determining unit 304 is configured to determine the mirroring segment with the segment attribute tag as the video segment of the video data.
  • model obtaining unit 301 the mirror obtaining unit 302 , the label determining unit 303 and the segment determining unit 304 may refer to the description of step S101 in the embodiment corresponding to FIG. 3 above, and will not be repeated here.
  • the template obtaining module 40 is used to determine the video template associated with the target user from the video template database based on the user portrait of the target user, and obtain at least one template segment and template tag sequence predetermined in the video template, the template tag The sequence consists of template attribute tags of the at least one template fragment.
  • the template acquisition module 40 includes: a behavior extraction unit 401, a behavior analysis unit 402, and a template analysis unit 403;
  • the behavior extraction unit 401 is used to obtain the behavior log table of the target user, and extract the behavior data information associated with the target user from the behavior log table;
  • the behavior analysis unit 402 is used for performing user portrait analysis on the behavior data information, obtaining a user portrait for characterizing the target user, and determining the video template associated with the target user from the video template database based on the user portrait of the target user;
  • the template analysis unit 403 is configured to obtain the at least one template segment and the template tag sequence predetermined in the video template.
  • behavior extraction unit 401 the behavior analysis unit 402 and the template analysis unit 403 may refer to the description of step S102 in the embodiment corresponding to FIG. 3 above, and will not be repeated here.
  • the material determination module 50 is configured to, based on the template attribute tag of the at least one template fragment and the fragment attribute tags corresponding to the plurality of video fragments, filter a video fragment that matches the template attribute tag of the at least one template fragment from the plurality of video fragments, according to The position of each template segment in the at least one template segment is spliced with at least one matched video segment as a video material segment of the target video.
  • the number of template fragments is N, where N is a positive integer greater than 1; the template tag sequence includes N sequence positions, one sequence position corresponds to one template attribute tag, and one template attribute tag corresponds to one template fragment.
  • the material determination module 50 includes: a tag determination unit 501, a tag screening unit 502, a segment matching unit 503, and a material generation unit 504;
  • the label determination unit 501 is configured to use N template fragments as target template fragments, determine the queue position of the target template fragment in the template label sequence as the target queue position, and determine the template attribute label corresponding to the target queue position as the target template attribute label ;
  • the tag screening unit 502 is configured to filter the segment attribute tags that match the target template attribute tags among the segment attribute tags corresponding to the multiple video segments, and determine one or more video segments corresponding to the filtered segment attribute tags as candidate video segment;
  • the segment matching unit 503 is used to perform similarity analysis between each candidate video segment in the candidate video segment and the target template segment, obtain the similarity threshold between each candidate video segment and the target template, determine the maximum similarity threshold in the similarity threshold, and set the maximum similarity threshold to the maximum similarity threshold.
  • the candidate video segment corresponding to the similarity threshold is determined as the target candidate video segment that matches the target template segment;
  • the material generation unit 504 is configured to determine, based on the target queue position of the target template segment in the template label sequence, a target label sequence formed by the segment attribute labels corresponding to the target candidate video segment, and convert all target candidate videos associated with the target label sequence to the target label sequence.
  • the clips are spliced to obtain video material clips.
  • the material generating unit 504 includes: a video splicing sub-unit 5041 and a material synthesizing sub-unit 5042;
  • the video splicing subunit 5041 is used to perform video splicing processing on all target candidate video clips associated with the target tag sequence to obtain splicing video data associated with N template clips;
  • the material synthesis subunit 5042 is configured to obtain template audio data associated with the N template segments, and perform audio and video merging processing on the template audio data and the spliced video data through the audio and video synthesis component to obtain video material segments.
  • the data sending module 60 is configured to push the video data and the video material clips to the application client corresponding to the target user, so that the application client outputs the video data and the video material clips.
  • the request receiving module 10 is configured to receive a video playback request sent by the application client; the video playback request is generated by the application client in response to the playback operation performed by the target user on the target video;
  • the data search module 20 is used for extracting the video identification of the target video from the video playback request, searching for the service video data corresponding to the target video in the video service database based on the video identification, and using the found service video data as the target in the application client Video data for the video.
  • the specific implementation of the segment generation module 30, the template acquisition module 40, the material determination module 50, and the data transmission module 60 can be referred to the description of steps S101 to S105 in the embodiment corresponding to FIG. 3, and will not be repeated here. Repeat.
  • the request receiving module 10 and the data searching module 20 reference may be made to the description of step S201 and step S207 in the embodiment corresponding to FIG. 9 , which will not be repeated here.
  • the description of the beneficial effects of using the same method will not be repeated.
  • FIG. 14 is a schematic structural diagram of a video data processing apparatus provided by an embodiment of the present application.
  • the video data processing apparatus 2 may include: a data acquisition module 70 and a data output module 80;
  • the data acquisition module 70 is used to respond to the playback operation performed by the target user for the target video in the application client, and acquire the video data of the target video and the video material clips associated with the target video from the server; the video material clips are generated by the server.
  • Video analysis is performed on the video data to obtain a plurality of video clips, wherein the video analysis includes mirroring processing and attribute analysis based on a plurality of preset clip attribute tags, and each video clip in the plurality of video clips corresponds to a A segment attribute label and a storyboard segment; based on the user portrait of the target user, determine the video template associated with the target user from the video template database, and obtain at least one predetermined template segment and template label sequence in the video template, the described
  • the template tag sequence is composed of the template attribute tags of the at least one template fragment; based on the template attribute tags of the at least one template fragment and the fragment attribute tags corresponding to the multiple video fragments, the templates corresponding to the at least one template fragment are filtered from the multiple video fragments At least one video segment
  • the data acquisition module 70 includes: a request sending unit 701 and a data receiving unit 702;
  • the request sending unit 701 is used to respond to the playback operation performed by the target user for the target video in the application client, generate a video playback request for requesting playback of the target video, and send the video playback request to the server; the video playback request carries the target video The video identification; the video identification is used to instruct the server to obtain the video data of the target video requested by the target user to play;
  • the data receiving unit 702 is configured to receive the video data returned by the server based on the video playback request, and the video material clips associated with the target video;
  • the template is obtained after video analysis and video matching of the video data, and the user portrait is determined by the user behavior information of the target user in the application client.
  • step S201 for the specific implementation of the request sending unit 701 and the data receiving unit 702, reference may be made to the description of step S201 in the embodiment corresponding to FIG. 9 above, which will not be repeated here.
  • the data output module 80 is configured to output video data and video material clips in the application display interface of the application client.
  • the data output module includes: a video playback unit 801 and a material output unit 802;
  • a video playback unit 801 configured to determine a video playback interface for playing video data in an application display interface of an application client, and play video data in the video playback interface;
  • the material output unit 802 is configured to play video material clips in the application display interface in response to a trigger operation on the application display interface.
  • step S207 for the specific implementation of the video playback unit 801 and the material output unit 802, reference may be made to the description of step S207 in the embodiment corresponding to FIG. 9 above, which will not be repeated here.
  • step S201 and step S207 in the embodiment corresponding to FIG. 9 , which will not be repeated here.
  • step S201 and step S207 in the embodiment corresponding to FIG. 9 , which will not be repeated here.
  • step S207 in the embodiment corresponding to FIG. 9 .
  • the description of the beneficial effects of using the same method will not be repeated.
  • FIG. 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 2000 may include: a processor 2001 , a network interface 2004 and a memory 2005 , in addition, the above-mentioned computer device 2000 may further include: a user interface 2003 and at least one communication bus 2002 .
  • the communication bus 2002 is used to realize the connection and communication between these components.
  • the user interface 2003 may also include standard wired and wireless interfaces.
  • the network interface 2004 may include a standard wired interface, a wireless interface (eg, a WI-FI interface).
  • the memory 2005 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory.
  • the memory 2005 may also be at least one storage device located remotely from the aforementioned processor 2001 .
  • the memory 2005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 2004 can provide the network communication function; the user interface 2003 is mainly used to provide an input interface for the user; and the processor 2001 can be used to call the device control stored in the memory 2005 application.
  • the computer device 2000 described in the embodiments of this application may be a server or a user terminal, which will not be limited here. It can be understood that the computer device 2000 can be used to execute the description of the video data processing method in the foregoing embodiment corresponding to FIG. 3 or FIG. 9 , and details are not repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the video data processing apparatus 1 or the video data processing apparatus 2 mentioned above.
  • a computer program, and the computer program includes program instructions.
  • the processor executes the program instructions, it can execute the description of the video data processing method in the embodiment corresponding to FIG. 3 or FIG.
  • the description of the beneficial effects of using the same method will not be repeated.
  • FIG. 16 is a video data processing system further provided by an embodiment of the present application.
  • the video data processing system 3 may include a server 3a and a user terminal 3b, the server 3a may be the video data processing apparatus 1 in the embodiment corresponding to FIG. 13 ; the user terminal 3b may be the implementation corresponding to the aforementioned FIG. 14 .
  • the video data processing device 2 in the example. It can be understood that the description of the beneficial effects of using the same method will not be repeated.
  • the embodiments of the present application further provide a computer program product or computer program
  • the computer program product or computer program may include computer instructions, and the computer instructions may be stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor can execute the computer instructions, so that the computer device executes the description of the video data processing method in the embodiment corresponding to FIG. 3 or FIG. , which will not be repeated here.
  • the description of the beneficial effects of using the same method will not be repeated.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

Abstract

本申请实施例提供了一种视频数据处理方法、装置、设备以及介质,该方法涉及人工智能领域,方法包括:获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段;基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为目标视频的视频素材片段;将视频数据以及视频素材片段推送至目标用户对应的应用客户端,以使应用客户端输出视频数据以及视频素材片段。

Description

一种视频数据处理方法、装置、设备以及介质
本申请要求2020年12月02日提交的申请号为202011390109.2、发明名称为“一种视频数据处理方法、装置、设备以及介质”的中国专利申请的优先权。
技术领域
本申请涉及计算机技术领域,尤其涉及一种视频数据处理方法、装置、设备以及介质。
背景技术
随着多媒体技术的发展,视频已成为人们日常生活中获取信息与享受娱乐的主要载体。因为各类视频播放平台的普及,衍生出了各式各样的短视频(即精彩视频集锦)。可以理解的是,这里的短视频是指在各类视频播放平台上播放的、适合在移动状态和短时休闲状态下观看的视频内容。
但是目前,在生成短视频的过程中,往往需要人工剪辑素材、人工合成视频、以及人工配乐、音频视频合成等。
发明内容
本申请实施例提供一种视频数据处理方法、装置、设备以及介质,通过对视频数据进行视频分析(例如,视频分镜和属性分析等),可以快速得到携带片段属性标签的一个或者多个视频片段,使用基于目标用户的用户画像确定的视频模板,进行目标视频的视频片段的属性标签匹配,生成目标视频的视频素材片段,可以随着视频模板的增加和更新,重复利用这些视频片段的分镜和属性信息,减少对目标视频中视频帧的识别和处理,提高短视频的生成效率,节省针对不同用户不断生成和分发大量短视频的计算成本,节省服务器的计算资源。
本申请实施例一方面提供了一种视频数据处理方法,包括:
获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段,其中,视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;
基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由至少一个模板片段的模板属性标签构成;
基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;
按照所述至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为目标视频的视频素材片段;
将视频数据以及视频素材片段推送至目标用户对应的应用客户端,以使应用客户端输出视频数据以及视频素材片段。
本申请实施例一方面提供了一种视频数据处理装置,包括:
片段生成模块,用于获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段,其中,视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;
模板获取模块,用于基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模 板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,模板标签序列由至少一个模板片段的模板属性标签构成;
素材确定模块,用于基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段,按照所述至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,,作为目标视频的视频素材片段;
数据发送模块,用于将视频数据以及视频素材片段推送至目标用户对应的应用客户端,以使应用客户端输出视频数据以及视频素材片段。
本申请实施例一方面提供了一种视频数据处理方法,包括:
响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取目标视频的视频数据,以及与目标视频相关联的视频素材片段;视频素材片段是由服务器对视频数据进行视频分析得到多个视频片段,其中,视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段(也就是说,每个视频片段即为对应于一个片段属性标签的一个分镜片段);基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,模板标签序列由至少一个模板片段的模板属性标签构成;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接得到的;
在应用客户端的应用显示界面中输出视频数据以及视频素材片段。
本申请实施例一方面提供了一种视频数据处理装置,包括:
数据获取模块,用户响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取目标视频的视频数据,以及与目标视频相关联的视频素材片段;视频素材片段是由服务器对视频数据进行视频分析得到多个视频片段,其中,视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,模板标签序列由至少一个模板片段的模板属性标签构成;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接得到的;
数据输出模块,用于在应用客户端的应用显示界面中输出视频数据以及视频素材片段。
本申请实施例一方面提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行本申请实施例提供的方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序包括程序指令,程序指令当被处理器执行时,执行如本申请实施例提供的方法。
本申请实施例一方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例提供的方法。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要 使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种网络架构的结构示意图;
图2是本申请实施例提供的一种进行数据交互的场景示意图;
图3是本申请实施例提供的一种视频数据处理方法的流程示意图;
图4是本申请实施例提供的一种查询视频数据的场景示意图;
图5是本申请实施例提供的一种进行分镜处理的场景示意图;
图6是本申请实施例提供的一种提取片段属性标签的流程示意图;
图7是本申请实施例提供的一种获取视频模板的场景示意图;
图8A是本申请实施例提供的一种对模板视频进行视频分析的场景示意图;
图8B是本申请实施例提供的一种对目标视频进行视频分析的场景示意图;
图9是本申请实施例提供的一种视频数据处理方法的流程示意图;
图10是本申请实施例提供的一种生成视频素材片段的流程示意图;
图11是本申请实施例提供的一种前后端交互的流程示意图;
图12A是本申请实施例提供的一种输出视频素材片段的场景示意图;
图12B是本申请实施例提供的一种更新视频素材片段的场景示意图;
图13是本申请实施例提供的一种视频数据处理装置的结构示意图;
图14是本申请实施例提供的一种视频数据处理装置的结构示意图;
图15是本申请实施例提供的一种计算机设备的结构示意图;
图16是本申请实施例提供的一种视频数据处理系统。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,人工智能(Artificial Intelligence,简称AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等方向。
其中,计算机视觉(Computer Vision,简称CV)是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等 生物特征识别技术。
具体的,请参见图1,图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示,该网络架构可以包括业务服务器2000和用户终端集群。其中,用户终端集群具体可以包括一个或者多个用户终端,这里将不对用户终端集群中的用户终端的数量进行限制。如图1所示,多个用户终端具体可以包括用户终端3000a、用户终端3000b、用户终端3000c、…、用户终端3000n。其中,用户终端3000a、用户终端3000b、用户终端3000c、…、用户终端3000n可以分别与业务服务器2000通过有线或无线通信方式进行直接或间接地网络连接,以便于每个用户终端可以通过该网络连接与业务服务器2000之间进行数据交互。
其中,如图1所示的业务服务器2000可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。
应当理解,如图1所示的用户终端集群中的每个用户终端均可以集成安装有应用客户端,当该应用客户端运行于各用户终端中时,可以分别与上述图1所示的业务服务器2000之间进行数据交互。其中,该应用客户端可以理解为一种能够加载并显示视频数据的应用,例如,这里的应用客户端具体可以包括:车载客户端、智能家居客户端、娱乐客户端(例如,游戏客户端)、多媒体客户端(例如,视频客户端)、社交客户端以及资讯类客户端(例如,新闻客户端)等。其中,为便于理解,本申请实施例可以在图1所示的多个用户终端中选择一个用户终端作为目标用户终端,该目标用户终端可以包括:智能手机、平板电脑、笔记本电脑、智能电视等具有视频数据加载功能的智能终端。例如,本申请实施例可以将图1所示的用户终端3000a作为目标用户终端。
为便于理解,比如,当用户Y(即目标用户)在上述目标用户终端中需要播放某个视频(比如,该用户Y自己感兴趣的视频)时,该目标用户终端可以响应用户Y针对该视频的触发操作,向图1所示的业务服务器2000发送视频播放请求。这样,该业务服务器2000可以基于该视频播放请求在视频业务数据库中查找到这个视频的视频数据,进而可以将这个视频数据和与这个视频数据相关联的一个或者多个视频素材片段(例如,这个视频的视频花絮等)返回给目标用户终端,以在该目标用户终端中对该用户Y所请求播放的这个视频的视频数据进行播放处理。在一实施方式中,与此同时,该目标用户终端还可以在播放这个视频数据时,一并显示接收到的这些视频素材片段。可以理解的是,这里的视频素材片段可以是由该业务服务器2000按照选取的视频模板的模板片段以及模板片段对应的模板标签序列,对目标视频的视频片段进行筛选后所得到的;此外,可以理解的是,这里的视频片段可以是由该业务服务器2000对视频数据进行视频分析后所得到的;应当理解,这里的视频模板可以是由该业务服务器2000基于该用户Y(即目标用户)的用户画像所确定的。
其中,可以理解的是,本申请实施例可以将上述用户Y(即目标用户)在应用客户端(例如,视频客户端K)中所选择的贴合自己兴趣的视频(比如,电视剧或者短视频等)统称为目标视频。
其中,本申请实施例中的视频素材片段可以是由上述业务服务器2000根据上述视频模板的模板片段和模板标签序列所智能生成的。比如,业务服务器2000可以通过本申请实施例所涉及的视频数据处理方法,智能生成上述用户Y在上述目标用户终端中所选择的目标视频(例如,电视剧S1)的一个或多个视频素材片段。其中,视频素材片段的生成过程是指业务服务器2000可以将电视剧S1的视频片段的标签信息(即片段属性标签)与模板视频(例如,视频M)的模板片段的标签信息(即模板属性标签)进行标签匹配以及内容相似度匹配的过程,进而可以根据标签匹配以及内容相似度匹配的结果,从电视剧S1的视频片段中筛选与视频M的模板片段中的每个模板片段具有相似视频播放效果的视频片段,从而可以根据筛选出的这些视频片段所构成的拼接视频数据以及模板片段的模板音频数据,智能生成与视频M相似的视频 素材片段。
应当理解,上述网络框架适用于人工智能领域(即AI领域),该AI领域所对应的业务场景可以为视频分类场景、视频推荐场景等,这里将不对具体的业务场景进行一一列举。
其中,这里的视频分类场景主要是指计算机设备(例如,上述业务服务器2000)在通过上述视频数据处理方法,可以将同一视频下的视频片段存储于第一业务数据库。比如,计算机设备在基于某个视频模板生成视频素材片段(例如,基于视频模板B1生成的视频素材片段A1和基于视频模板B2生成的视频素材片段A2)之后,还可以将视频素材片段A1和视频素材片段A2添加至相应的短视频推荐数据库,这里的短视频推荐数据库至少可以包含第一业务数据库和第二业务数据库。其中,这里的第一业务数据库可以用于存储与同一视频相关联的一个或者多个视频素材片段。比如,若视频素材片段A1和视频素材片段A2均属于同一视频(例如,视频W)的视频片段,则可以将视频素材片段A1和视频素材片段A2添加至这个视频W所对应的第一业务数据库。在一实施方式中,若视频素材片段A1和视频素材片段A2分别属于不同视频的视频片段,比如,若视频素材片段A1对应的目标视频为用户Y1所请求的视频W1,则可以将视频素材片段A1添加至这个视频W1所对应的第一业务数据库;若视频素材片段A2对应的目标视频为用户Y2所请求的视频W2,则可以将视频素材片段A2添加至这个视频W2所对应的第一业务数据库。
其中,这里的第二业务数据库可以用于存储与同一视频模板相关联的一个或者多个视频素材片段。这意味着本申请实施例可以在不同视频的视频素材片段中,将使用同一视频模板的视频片段添加至第二业务数据库。例如,计算机设备在基于某个视频模板生成视频素材片段(例如,基于视频模板B生成视频素材片段A)之后,还可以将视频素材片段A添加至视频模板B所对应的第二业务数据库。为便于理解,比如,若该视频模板B为表情集锦类,则可以将视频素材片段A添加至这个表情集锦类所对应的第二业务数据库。又比如,若该视频模板B为故事情节集锦类,则可以将该视频素材片段A添加至这个故事情节集锦类所对应的第二业务数据库。再比如,若该视频模板B为人物混剪集锦类,则可以将该视频素材片段A添加至这个人物混剪集锦类所对应的第二业务数据库。
此外,应当理解,在上述视频推荐场景下,计算机设备(例如,上述业务服务器2000)在通过上述视频数据处理方法智能生成目标用户所请求的目标视频的视频素材片段(例如,基于视频模板B1生成的视频素材片段A1和基于视频模板B2生成的视频素材片段A2)之后,还可以将上述同一视频W的这些视频片段(即上述第一业务数据库中的视频片段,例如,视频素材片段A1和视频素材片段A2)添加至短视频推荐列表(例如,短视频推荐列表1),以将该短视频推荐列表1中的这些视频片段智能推送给上述目标用户。这样,当目标用户在上述应用客户端中观看完上述视频W(即目标视频)之后,还可以智能在应用客户端中为上述目标用户遍历播放上述短视频推荐列表1中的这些视频片段。比如,当目标用户在上述目标用户终端中观看完上述视频W时,计算机设备(例如,上述业务服务器2000)还可以将短视频推荐列表1中的多个视频素材片段中的视频素材片段A1输出至应用客户端,以在该应用客户端中实现对该视频素材片段A1的智能播放。可以理解的是,该视频素材片段A1与上述视频模板B1所映射的模板片段具有相似的视频播放效果。
在一实施方式中,计算机设备(例如,上述业务服务器2000)在通过上述视频数据处理方法智能生成视频素材片段(例如,基于视频模板B1生成的视频素材片段A1和视频素材片段A3)之后,还可以将使用同一视频模板B1的这些视频片段(即上述第二业务数据库中的视频片段,例如,视频素材片段A1和视频素材片段A3)添加至另一短视频推荐列表(例如,短视频推荐列表2),以将该短视频推荐列表2中的这些视频片段智能推送给上述目标用户。这样,当目标用户在上述应用客户端中观看短视频推荐列表2中的视频片段(例如,视频素材片段A1之后,还可以智能在应用客户端中为上述目标用户遍历播放上述短视频推荐列表2中的这些视频片段。比如,当目标用户在上述目标用户终端中观看完上述视频素材片段 A1时,计算机设备(例如,上述业务服务器2000)还可以将短视频推荐列表2中的其他视频素材片段(例如,视频素材片段A3)输出至应用客户端,以在该应用客户端中实现对该视频素材片段A3的智能播放。可以理解的是,由于该视频素材片段A3与上述视频模板B1使用的是同一视频模板,所以,当在应用客户端中播放该视频素材片段A3与上述视频模板B1时,将为目标用户呈现出与上述模板片段B1所映射的模板片段相似的视频播放效果。
为便于理解,进一步的,请参见图2,图2是本申请实施例提供的一种进行数据交互的场景示意图。其中,如图2所示的服务器可以为上述图1所对应实施例中的业务服务器2000,如图2所示的用户终端X可以为上述图1所对应实施例的用户终端集群中的任意一个用户终端。为便于理解,本申请实施例以上述图1所示的用户终端3000a作为该用户终端X为例,以阐述在该业务场景为视频推荐场景下,为目标用户推荐视频素材片段的具体过程。
如图2所示,视频推荐界面200a中可以包含多个推荐视频数据,这里的多个推荐视频数据具体可以包括图2所示的视频数据20a、视频数据20b、视频数据20c和视频数据20d。可以理解的是,本申请实施例可以将展示在视频推荐界面200a中的视频数据20a、视频数据20b、视频数据20c和视频数据20d统称为推荐视频数据。
其中,如图2所示,当目标用户需要播放某个推荐视频数据(例如,视频数据20b)时,可以将该目标用户从该视频推荐界面200a中所选取的视频数据20b统称为应用显示界面中的目标视频。此时,用户终端可以响应针对应用显示界面中的该目标视频的播放操作,向图2所示的服务器发送视频播放请求。此时,服务器可以响应该视频播放请求,以在应用客户端输出该目标视频对应的视频播放界面,例如,可以在应用客户端中输出视频数据20b对应的视频播放界面,该视频数据20b对应的视频播放界面可以为图2所示的视频播放界面200b。其中,应用显示界面中可以包含用于播放目标视频的视频播放界面200b,还可以包括用于展示视频素材片段的短视频推荐列表,该短视频推荐列表中可以包含与该目标视频相关联的视频素材片段。
其中,可以理解的是,服务器在接收目标用户通过用户终端发送的视频播放请求时,可以从该视频播放请求中获取目标视频的视频标识,并根据视频标识在视频业务数据库中查询目标视频的视频数据。在查询到目标视频的视频数据后,服务器可以对该视频数据的视频序列进行上述视频分析,以得到该视频数据的视频片段,这里的视频片段具体可以包括图2所示的视频片段100a、视频片段100b、…、视频片段100k,这里的每个视频片段都可以对应一个片段属性标签。
进一步的,服务器可以基于该目标用户的用户画像,获取贴合该目标用户观影兴趣的视频模板,进而可以获取该视频模板所映射的模板片段以及该模板片段所对应的模板标签序列,以便于能够根据该模板标签序列,从上述视频片段中筛选与每个模板片段相匹配的视频片段(即满足片段匹配条件的视频片段),进而可以基于这些筛选出的满足片段匹配条件的视频片段,得到视频素材片段。由此可见,本申请实施例可以尽可能地从这些视频片段中获取与模板片段有相同标签序列特征的视频片段,进而可以按照相同的标签序列(即上述模板标签序列)填充得到上述视频素材片段(比如,可以得到上述目标视频的一个或者多个短视频),以使用户终端可以将该视频素材片段和上述视频数据输出至应用客户端。其中,可以理解的是,一个视频模板,可以对应一个或者多个视频素材片段,比如,这里将不对从目标视频的视频片段中所筛选出的具有相同标签序列特征的视频素材片段的数量进行限定。
为便于理解,本申请实施例以一个视频模板对应一个视频素材片段为例。那么,当服务器确定出贴合该目标用户的观影兴趣的视频模板的数量有多个(例如,N个)时,本申请实施例还可以用于将智能生成N个视频素材片段的N个视频模板统称为视频模板。应当理解,通过N个视频模板智能生成其他视频素材片段的具体实现方式,可以一并参见对智能生成上述生成视频素材片段的具体过程的描述,这里将不再 继续进行赘述。
其中,在目标用户终端中推荐视频素材片段的具体实现方式,可以参见下述图3-图12B所对应的实施例。
进一步的,请参见图3,图3是本申请实施例提供的一种视频数据处理方法的流程示意图。如图3所示,该方法可以由应用客户端执行,也可以由服务器执行,还可以由应用客户端和服务器共同执行。该应用客户端可以为上述图2所对应实施例中的用户终端X中运行的应用客户端,该服务器可以为上述图2所对应实施例中的服务器。为便于理解,本实施例以该方法由服务器执行为例进行说明,以阐述在服务器中基于视频模板生成目标视频对应的视频素材片段的具体过程。其中,该方法至少可以包括以下步骤S101-步骤S105:
步骤S101,获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段。
具体的,服务器可以获取目标用户请求的目标视频的视频数据以及与视频数据相关联的网络识别模型。进一步的,服务器可以通过视频切分组件将视频数据对应的视频序列进行分镜处理,得到与视频序列相关联的多个分镜片段。进一步的,服务器可以将多个分镜片段输入至网络识别模型,由网络识别模型基于多个预设的片段属性标签,对多个分镜片段进行属性分析,得到多个分镜片段对应的片段属性标签。进一步的,服务器可以将具备片段属性标签的多个分镜片段确定为视频数据的多个视频片段。其中,一个视频片段可以对应一个片段属性标签。
应当理解,在获取目标用户请求的目标视频的视频数据之前,服务器可以接收应用客户端发送的视频播放请求。其中,视频播放请求是由应用客户端响应目标用户针对目标视频执行的播放操作所生成的。进一步的,服务器可以从视频播放请求中提取目标视频的视频标识,基于视频标识在视频业务数据库中查找目标视频对应的业务视频数据,将查找到的业务视频数据作为应用客户端中的目标视频的视频数据。
为便于理解,请参见图4,图4是本申请实施例提供的一种查询视频数据的场景示意图。如图4所示,这里的应用显示界面400a可以为上述图2所对应实施例中的应用显示界面200a。在目标用户针对应用客户端的应用显示界面400a中的视频数据40b执行触发操作(即播放操作)时,应用客户端可以将该视频数据40b作为目标视频,并向服务器发送携带该视频数据40b的视频标识的视频播放请求,进而服务器可以接收应用客户端发送的视频播放请求,获取该视频播放请求所携带的关于视频数据40b的视频标识,并基于该视频标识在应用客户端对应的视频业务数据库中查找该视频标识对应的业务视频数据,并将查找的业务视频数据作为视频数据40b对应的视频数据。
其中,可以理解的是,这里的目标视频可以为综艺节目、电影、电视剧等长视频,还可以为从长视频中截取的短视频等,本申请对此不做限制。
应当理解,服务器通过视频切分组件将视频数据对应的视频序列进行分镜处理,得到与视频序列相关联的多个分镜片段的具体过程可以描述为:该服务器可以在获取用于对视频数据的视频序列进行分镜处理的视频切分组件时,通过视频切分组件在视频序列中确定用于作为聚类质心的第一视频帧,并创建该第一视频帧所属的分镜簇的分镜簇信息(可以理解的是,这里的分镜簇信息可以为配置的相应分镜簇的标识)。进一步的,服务器可以在视频序列中将除第一视频帧之外的视频帧确定为第二视频帧,并可以基于轮询机制依次获取每个第二视频帧,以确定每个第二视频帧与第一视频帧的图像相似度。进一步的,若第一视频帧与某个第二视频帧的图像相似度大于或者等于聚类阈值,则服务器可以将图像相似度大于或者等于聚类阈值的该第二视频帧划分到第一视频帧所属的分镜簇。进一步的,若第一视频帧与某个第二视频帧的图像相似度小于聚类阈值,则服务器可以用图像相似度小于聚类阈值的该第二视频帧更新第一视频帧(例如, 将该第二视频帧作为更新后的第一视频帧),并创建更新后的第一视频帧所属的另一分镜簇的分镜簇信息,进而可以将更新后的第一视频帧依次与未匹配的第二视频帧进行图像相似度匹配,直到视频序列中的视频帧均完成图像相似度匹配时,可以得到视频序列中的视频帧所属的分镜簇的分镜簇信息(即可以划分得到视频序列中的每个视频帧所属的分镜簇)。进一步的,服务器可以基于视频序列中的视频帧所属的分镜簇的分镜簇信息,确定与视频序列相关联的分镜片段,即,将视频序列中的视频帧组成多个分镜片段。
可以理解的是,图像相似度匹配指的是对两幅图像之间内容的相似程度进行计算,可以得到用于判断图像内容的相似程度的图像相似度。若图像相似度越大,则表明这两幅图像越相似,若图像相似度越小,则表明这两幅图像越不相似。其中,两幅图像之间内容的相似程度可以使用不同的方法来衡量。比如,在使用余弦相似度的情况下,可以把图片表示成一个向量,通过计算向量之间的余弦距离来表征两张图片的相似度;直方图可以描述一幅图像中颜色的全局分布,直方图相似度是另一种图像相似度计算方法;结构相似度是一种全参考的图像质量评价指标,分别从亮度、对比度、结构三个方面度量图像相似性。应当理解,本申请对图像相似度匹配时具体使用的方法不做限制。
为便于理解,请参见图5,图5是本申请实施例提供的一种进行分镜处理的场景示意图。如图5所示的视频序列可以包括多个视频帧,具体可以包括图2所示的n个视频帧,n可以为大于1的正整数,这n个视频帧具体可以包括:视频帧10a、视频帧10b、视频帧10c、视频帧10d、…、视频帧10n。应当理解,通过聚类算法可以计算该视频序列中的视频帧之间的图像相似度,从而可以基于计算得到的视频帧之间的图像相似度,将该视频序列中的视频帧划分到不同的聚类簇(即分镜簇)。比如,通过聚类算法可以得到图5所示的k个聚类簇(即k个分镜簇),且这k个聚类簇具体可以包含图5所示的聚类簇20a、聚类簇20b、…、聚类簇20k。其中,可以理解的是,图5所示的k个聚类簇中每个聚类簇中均至少可以包括一个视频帧。
具体的,本申请可以在图5所示的视频序列中,将该视频序列中的首个视频帧(即视频帧10a)称之为能够用于作为聚类质心(即聚类质心1)的第一视频帧,并可以将该视频序列中除视频帧10a之外的视频帧确定为第二视频帧,并可以基于轮询机制依次获取第二视频帧(即视频帧10b、视频帧10c、…、视频帧10n),以依次计算第一视频帧与第二视频帧的图像相似度。其中,本申请可以创建该聚类质心1所属的分镜簇(即分镜簇1),进而可以将视频帧10b与视频帧10a进行图像相似度匹配。在视频帧10b与视频帧10a的图像相似度(例如,相似度1)大于或等于聚类阈值时,将相似度1所对应的视频帧10b划分到视频帧10a所属的分镜簇(即分镜簇1)中。同理,本申请可以将视频帧10c划分到视频帧10a所属的分镜簇(即分镜簇1)中。
进一步的,由于视频帧10d为视频帧帧10c的下一视频帧,因此,本申请可以将视频帧10d与视频帧10a进行图像相似度匹配。在视频帧10d与视频帧10a的图像相似度(例如,相似度2)小于聚类阈值时,根据视频帧10d更新第一视频帧,以将相似度2所对应的视频帧10d作为更新后的第一视频帧以及新的聚类质心(即聚类质心2),并可以创建该聚类质心2所属的分镜簇(即分镜簇2),进而,可以基于上述轮询机制依次获取未匹配的第二视频帧(即视频帧10e、…、视频帧10n),以依次计算更新后的第一视频帧与未匹配的第二视频帧的图像相似度。其中,本申请可以将视频帧10e划分到视频帧10d所属的分镜簇(即分镜簇2)中。
其中,可以理解的是,本申请在获取聚类质心1和聚类质心2之后,可以使用同样的方法获取聚类质心3、聚类质心4、…、聚类质心k。同理,本申请在获取分镜簇1(即聚类簇20a)和分镜簇2(即聚类簇20b)之后,可以使用同样的方法获取分镜簇3(即聚类簇20c)、分镜簇4(即聚类簇20d)、…、分镜簇k(即聚类簇20k)。此时,视频序列中的视频帧10a、视频帧10b、…、视频帧10n已经全部完成图像相似度匹配。
由此可见,通过对图5所示的视频序列中的视频帧进行聚类处理(即分镜处理),可以得到与该视频序列相关联的多个聚类簇(即分镜簇),从而可以将每个聚类簇中的视频帧构成一个分镜片段,进而可以得到图5所示的k个分镜片段。比如,可以将聚类簇20a中的视频帧10a、视频帧10b和视频帧10c构成分镜簇1对应的分镜片段(即分镜片段1),可以将聚类簇20b中的视频帧10d和视频帧10e构成分镜簇2对应的分镜片段(即分镜片段2),…,可以将聚类簇20k中的视频帧10(n-2)、视频帧10(n-1)和视频帧10n构成分镜簇k对应的分镜片段(即分镜片段k)。
应当理解,将目标视频对应的视频序列划分为多个分镜片段的视频切分组件可以为pyscenedetect开源代码库,该pyscenedetect开源代码库是一个自动将视频数据分割为单个片段的工具,其中,第一视频帧(聚类质心)的选择可以不限于上述方式。可以理解的是,将目标视频对应的视频序列划分为多个分镜片段的方法还可以为鼓点识别的方式,例如,获取目标视频的音频数据,识别该音频数据中的鼓点,根据鼓点在音频数据中的位置,确定鼓点在目标视频的视频数据中的位置,以对视频数据的视频序列进行划分。其中,将视频序列划分为多个分镜片段的方法还可以为其他的方式,本申请对视频分镜具体使用的分镜方法不做限制。
可以理解的是,上述网络识别模型至少包括:具有第一属性标签提取功能的第一网络模型、具有第二属性标签提取功能的第二网络模型和具有第三属性标签提取功能的第三网络模型。应当理解,服务器可以将多个分镜片段输入第一网络模型,通过第一网络模型对多个分镜片段中的每个分镜片段进行远近景分析,得到多个分镜片段的远近景标签,将多个分镜片段的远近景标签作为第一网络模型输出的第一属性标签,将具有第一属性标签的分镜片段作为第一类分镜片段。进一步的,服务器可以将第一类分镜片段输入第二网络模型,由第二网络模型对第一类分镜片段中的每个分镜片段进行人脸检测,得到人脸检测结果。进一步的,若人脸检测结果指示第一类分镜片段中存在目标角色的人脸,则服务器可以在第一类分镜片段中将存在目标角色的人脸所对应的分镜片段作为第二类分镜片段,通过第二网络模型确定第二类分镜片段中的目标角色所属的角色标签,将目标角色所属的角色标签确定为第二类分镜片段的第二属性标签。其中,目标角色为目标视频中的一个或者多个角色。进一步的,服务器可以在第一类分镜片段中将除第二类分镜片段之外的分镜片段,确定为第三类分镜片段,将第三类分镜片段输入第三网络模型,由第三网络模型对第一类分镜片段中的每个分镜片段进行场景检测,得到第三类分镜片段的第三属性标签。进一步的,服务器可以根据第一类分镜片段的第一属性标签、第二类分镜片段的第二属性标签、以及第三类分镜片段的第三属性标签,确定多个分镜片段中的每个分镜片段对应的片段属性标签。
可以理解的是,第一网络模型可以为远近景识别模型,第二网络模型可以为人脸识别模型,第三网络模型可以为场景识别模型。基于此,上述第一网络模型、第二网络模型和第三网络模型还可以为表情识别模型、动作识别模型等,本申请对网络识别模型的具体类型不做限制。同理,网络识别模型可以采用任意深度学习模型或机器学习模型,本申请对网络识别模型使用的具体模型不做限制。应当理解,通过预先训练的网络识别模型,可以提高目标视频的基础分析能力,进而可以快速得到目标视频对应的视频片段。
可以理解的是,可以通过人脸检测模型对分镜片段进行人脸检测,得到人脸检测结果,进而可以通过人脸识别模型确定人脸检测结果中的人脸所对应的角色标签。这里的人脸检测模型和人脸识别模型可以统称为第二网络模型。其中,人脸检测与人脸识别可以统称为图像检测。图像检测表示机器学习技术可以对标注样本数据(例如,图像中多个标注框与标签对的对应关系)进行学习后获得数学模型,在学习训练的过程中可以获得该数学模型的参数,识别预测时加载该数学模型的参数,并计算输入样本存在的实物标签的预测框以及该预测框属于指定范围内某个实物标签的概率,进而可以将具有最大概率的实物标签作为该预测框对应的标签。
其中,可以理解的是,将分镜片段直接输入远近景识别模型,可以获取该分镜片段对应的远近景标签 (即第一属性标签),将分镜片段直接输入场景识别模型,可以获取该分镜片段对应的场景标签(即第三属性标签)。在将分镜片段输入第三网络模型之前,需要提前对人脸进行目标角色的检索,即可以提前将分镜片段的目标角色输入第二网络模型,通过该第二网络模型提取该目标角色的特征向量。因此,在确定分镜片段的角色标签时,可以在将分镜片段输入第二网络模型时,提取该分镜片段中的全部视频帧,并对这些视频帧进行人脸检测,进而可以将检测到的人脸的特征向量与上述目标角色的特征向量进行比较。若特征向量比较得到的相似度结果大于阈值,则认为该人脸是目标角色,将目标角色的角色标签作为该检测到的人脸所在的分镜片段的角色标签(即第二属性标签)。
应当理解,本申请借助于上述网络识别模型(即第一网络模型、第二网络模型和第三网络模型),可以在无需知道分镜片段的标签信息(即不需要预先知道任何图像噪声信息)的情况下,直接将分镜片段输入第一网络模型(即上述远近景识别模型),以获取该分镜片段对应的第一属性标签(即根据模型自动进行样本噪声程度预测,以给出每个分镜片段的新标签),进而可以将具有第一属性标签的分镜片段统称为第一类分镜片段。可以理解的是,本申请实施例在得到第一类分镜片段之后,还可以进一步将新标签自动反馈到后续模型中进行学习,以通过动态噪声预测及处理,来避免上述网络识别模型陷入局部最优,保证模型学习是向识别效果更佳的方向进行。另外,当使用第一网络模型对目标视频的分镜片段进行远近景识别之后,得到的第一类分镜片段,在有新的视频模板时,也可以重复使用,不需要重复进行视频帧的识别和处理,进而可以节省计算资源。
比如,本申请实施例还可以进一步将第一类分镜片段输入至第二网络模型,以使第二网络模型可以对所述第一类分镜片段中的每个分镜片段进行人脸检测和人脸识别,进而可以在这些第一类分镜片段中挑选出所有的包含目标角色的人脸的第一类分镜片段。可以理解的是,本申请实施例可以将挑选出的这些包含目标角色的人脸的第一类分镜片段统称为第二类分镜片段。此外,可以理解的是,该第二网络模型还可以用于输出每个第二分镜片段中的目标角色所属的角色标签。基于此,本申请实施例可以将这里的目标角色所属的角色标签统称为第二类分镜片段的第二属性标签。应当理解,这里的目标角色可以为目标视频中的一个或者多个角色,这里将不对目标角色的数量进行限定。进一步的,本申请实施例还可以在第一分镜片段中将除第二类分镜片段之外的其他分镜片段统称为第三类分镜片段,进而可以将第三类分镜片段输入第三网络模型(即上述场景识别模型),以获取该第三类分镜片段对应的第三属性标签。由此可见,本申请实施例通过上述训练好的网络识别模型可以实时校正分镜片段所属的标签信息,进而可以根据上述第一属性标签、第二属性标签和第三属性标签,准确得到每个分镜片段的片段属性标签。
为便于理解,请参见图6,图6是本申请实施例提供的一种提取片段属性标签的流程示意图。如图6所示的视频数据可以为上述目标视频的视频数据,则得到分镜片段的标签信息的具体过程可以描述为:对视频数据的视频序列进行视频分镜可以得到k个分镜片段,进而可以将每个分镜片段输入网络识别模型,以得到每个分镜片段在网络识别模型下的标签信息。这里的网络识别模型可以为图6所示的场景识别模型(即第三网络模型)、远近景识别模型(即第一网络模型)、人脸检测模型和人脸识别模型(即第二网络模型)。
其中,可以理解的是,如图6所示,在将k个分镜片段输入远近景识别模型之后,得到的每个分镜片段对应的远近景标签(即第一属性标签)可以为:{分镜1:x1,分镜2:x2,…,分镜k:xk},这里的x1表示分镜1对应的远近景标签为x1,这里的x2表示分镜2对应的远近景标签为x2,…,这里的xk表示分镜k对应的远近景标签为xk。其中,上述远近景标签可以包括但不限于:远景、人物近景、人物特写、物体全景、物体特写等。其中,这里的分镜1、分镜2、…、分镜k可以为上述图5所对应实施例中的分镜片段1、分镜片段2、…、分镜片段k。
其中,可以理解的是,如图6所示,在将k个分镜片段输入远近景识别模型之后,可以将k个分镜片 段输入人脸检测模型和人脸识别模型。在将k个分镜片段输入人脸检测模型和人脸识别模型之后,得到的每个分镜片段对应的角色标签(即第二属性标签)可以为:{分镜1:y1,分镜2:y2,分镜4:y4,…,分镜k-1:yk-1},这里的y1表示分镜1对应的角色标签为y1,这里的y2表示分镜2对应的角色标签为y2,这里的y4表示分镜4对应的角色标签为y4,…,这里的yk-1表示分镜k-1对应的角色标签为yk-1。其中,上述角色标签可以包括但不限于:单人、双人等;上述角色标签还可以包括但不限于:男一、男二、女一、女二、小女孩A、小男孩B等。其中,分镜3、分镜5、…、分镜k不包括角色标签。
其中,可以理解的是,如图6所示,在将k个分镜片段输入人脸检测模型和人脸识别模型之后,可以将未获得检测或识别结果(即不包括角色标签)的分镜片段输入场景识别模型。在将未获得检测或识别结果的分镜片段输入场景识别模型之后,得到的每个分镜对应的场景标签(即第三属性标签)可以为:{分镜3:z3,分镜5:z5,…,分镜k:zk},这里的z3表示分镜3对应的场景标签为z3,这里的z5表示分镜5对应的场景标签为z5,…,这里的zk表示分镜k对应的场景标签为zk。其中,上述场景标签可以包括但不限于:自然场景、室内场景、人物建筑、竹林、河边、游乐园等。
应当理解,对于k个分镜片段中的某一个分镜片段而言,可以使用该分镜片段的远近景标签和角色标签、或远近景标签和场景标签,来共同描述该分镜片段的片段属性标签。比如,对于k个分镜片段中的分镜1而言,可以使用该分镜1的远近景标签和角色标签,共同来描述该分镜1的片段属性标签(即片段属性标签1),例如分镜1对应的远近景标签为远景(即x1为远景),分镜1对应的角色标签为男一(即y1为男一),则该分镜1对应的片段属性标签1可以为:{远景、男一}。
步骤S102,基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成。
具体的,服务器可以获取目标用户的行为日志表,从行为日志表中提取与目标用户相关联的行为数据信息。进一步的,服务器可以对行为数据信息进行用户画像分析,得到用于表征目标用户的用户画像,基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板。其中,视频模板可以携带模板片段的模板属性标签所构成的模板标签序列,模板片段是对模板视频进行视频分析后所得到的,模板视频是由行为数据信息所确定的。所述视频模板数据库可以是在服务器上,或者在独立于服务器的其他设备上。进一步的,服务器可以获取视频模板中包括的至少一个模板片段以及至少一个模板片段的模板属性标签构成的模板标签序列。其中,可以理解的是,本申请实施例可以将服务器在目标时长内所获取到的应用客户端中的不同用户的行为日志统称为行为日志表。
其中,可以理解的是,行为数据信息用于记录目标用户每次访问应用客户端时,产生的行为交互数据(访问,浏览,搜索,点击等)。这里的行为交互数据具体可以包括目标用户访问视频的类型、浏览视频的时间、浏览视频的次数、搜索视频的记录、点击视频的次数,以及目标用户收藏的视频、推荐的视频、点赞的视频、购买的视频、投币的视频等。
为便于理解,请参见图7,图7是本申请实施例提供的一种获取视频模板的场景示意图。如图7所示的日志管理系统70中具体可以包括多个数据库,多个数据库具体可以包含图7所示的数据库70a、数据库70b、…、数据库70n。这意味着该日志管理系统70可以用于存储应用客户端中的不同用户的行为日志。比如,数据库70a可以用于存储用户Y1(未在图上示出)的行为日志,数据库70b可以用于存储用户Y2(未在图上示出)的行为日志,…,数据库70n可以用于存储用户Yn(未在图上示出)的行为日志。
其中,如图7所示,在目标用户为上述用户Y1(即目标用户)时,服务器可以在数据库10a中获取目标用户在目标时长内的行为日志表,可以进一步在行为日志表中获取行为数据信息。应当理解,服务器在获取到目标用户的行为数据信息之后,可以对目标时长内的行为数据信息进行用户画像分析,以得到用 于表征目标用户的用户画像。
其中,这里的用户画像可以包括目标用户对于某一个视频类型的喜爱程度,服务器进而可以选择这一视频类型的视频模板作为与目标用户相关联的视频模板。同理,这里的用户画像可以包括目标用户对于某一个视频的喜爱程度,服务器进而可以选择这一个视频对应的视频模板作为与目标用户相关联的视频模板。可以理解的是,这里的视频模板对应的模板数据可以为与目标视频的视频数据具有相同视频类型的数据。比如,在目标视频为动漫时,可以在动漫类的视频模板中选择与目标视频相关联的视频模板。又比如,在目标视频为真人剧时,可以在真人剧类的视频模板中选择与目标视频相关联的视频模板。这样,可以为目标视频选择最佳的视频模板,提高视频素材片段的显示效果。
其中,可以理解的是,图7所示的日志管理系统70可以在单个行为记录周期内(比如,以天为该行为记录周期的单位),为访问该应用客户端的目标用户建立一个行为日志表。例如,该日志管理系统70可以在检测到该目标用户当天首次访问该应用客户端时,为该目标用户建立一个行为日志表。此时,这个行为日志表中记录了当前访问该应用客户端的访问时间戳(例如,T1时刻)。这意味着这个行为日志表中并不存在当前T1时刻之前的任何其他行为交互数据。进一步的,该日志管理系统70可以在当前这个行为记录周期达到记录周期阈值时,将为这个目标用户建立的这个行为日志表(例如,行为日志表1)添加到对应的数据库(例如,图7所示的数据库10a)进行存储。同理,日志管理系统70可以在目标用户的访问时间戳为其他时刻(例如,T2时刻)时,将该T2时刻所对应行为日志表(例如,行为日志表2)添加到对应的数据库(例如,图7所示的数据库10a)进行存储。
应当理解,当目标用户在记录周期内访问客户端,且与该应用客户端之间产生交互行为时,则日志管理系统70可以在这个记录周期的行为日志表中,记录该目标用户与应用客户端之间的交互行为。可以理解的是,这里的目标时长具体可以包括:一个或者多个记录周期。所以,服务器在上述目标时长(即截止到本次访问应用客户端前的多个记录周期)内所获取到的目标用户的行为日志表,具体可以包括上述行为日志表1、上述行为日志表1和行为日志表2。
步骤S103,基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段。
步骤S104,按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为目标视频的视频素材片段。
根据本申请实施例,服务器可以基于至少一个模板片段以及所述模板标签序列,在多个视频片段中筛选满足片段匹配条件的视频片段,将满足片段匹配条件的视频片段作为目标视频的视频素材片段。
具体的,服务器可以将N个模板片段作为目标模板片段,在模板标签序列中将目标模板片段的队列位置(目标模板片段在N个模板片段形成的队列中的位置或顺序)确定为目标队列位置,将目标队列位置对应的模板属性标签确定为目标模板属性标签。其中,模板片段的数量可以为N个,这里的N可以为大于1的正整数。因此,模板标签序列可以包含N个序列位置,一个序列位置对应一个模板属性标签,且一个模板属性标签对应一个模板片段。进一步的,服务器可以在多个视频片段对应的片段属性标签中,筛选与目标模板属性标签相匹配的片段属性标签,将筛选出的片段属性标签所对应的一个或多个视频片段确定为候选视频片段。进一步的,服务器可以将候选视频片段中的每个候选视频片段与目标模板片段进行相似分析,得到每个候选视频片段与目标模板片段的相似阈值。在相似阈值中确定最大相似阈值,将最大相似阈值所对应的候选视频片段确定为目标模板片段相匹配的目标候选视频片段。进一步的,服务器可以基于目标模板片段在模板标签序列中的目标队列位置,确定目标候选视频片段对应的片段属性标签所构成的目标标签序列,将与所述目标标签序列相关联的所有目标候选视频片段进行拼接处理,作为所述视频素材片段,即,根据与目标标签序列相关联的所有目标候选视频片段,确定满足片段匹配条件的视频素材片段。其中,视 频素材片段的片段属性标签所构成的目标标签序列与模板标签序列相同。
可以理解的是,相似分析可以表示候选视频片段与目标模板片段之间的场景相似度。将候选视频片段输入第三网络模型,可以得到候选视频片段对应的候选特征向量。将目标模板片段输入第三网络模型,可以得到目标模板片段对应的目标特征向量。通过计算候选特征向量与目标特征向量之间的向量距离,可以得到候选视频片段与目标模板片段之间的相似度(即上述相似阈值)。考虑到该第三网络模型为场景识别模型,则这里的相似度可以表示场景相似度。其中,相似分析还可以表示候选视频片段与目标模板片段之间远近景相似度,相似分析还可以表示候选视频片段与目标模板片段之间人物相似度。
比如,可以将目标模板片段输入第三网络模型,得到目标模板片段的目标特征向量。假设存在2个候选视频片段,这2个候选视频片段具体可以包括:候选视频片段1和候选视频片段2,其中,将这2个候选视频片段输入第三网络模型,得到候选视频片段1的候选特征向量1以及候选视频片段2的候选特征向量2。在计算上述目标特征向量分别与2个候选特征向量之间的向量距离之后,若目标特征向量与候选特征向量2之间的距离最小,则表示目标模板片段与候选视频片段2之间的相似阈值为最大相似阈值,可以将候选特征向量2所对应的候选视频片段2作为与目标模板片段相匹配的目标候选视频片段。其中,相似分析还可以表示候选视频片段与目标模板片段之间的时长关系,本申请对相似分析的计算方法不做具体限制。
为便于理解,请参见图8A和图8B,图8A是本申请实施例提供的一种对模板视频进行视频分析的场景示意图,图8B是本申请实施例提供的一种对目标视频进行视频分析的场景示意图。对模板视频进行视频分析后可以得到图8A所示的N个模板片段,这里的N可以为大于1的正整数。例如,N等于4,则4个模板片段可以包括:模板片段80a、模板片段80b、模板片段80c和模板片段80d。其中,模板片段80a对应的模板属性标签为{远景},模板片段80b对应的模板属性标签为{人物特写},模板片段80c对应的模板属性标签为{人物近景}以及模板片段80d对应的模板属性标签为{物体近景}。对目标视频进行视频分析后可以得到图8B所示的M个视频片段,这里的M可以为大于1的正整数。例如,M等于8,则8个视频片段可以包括:视频片段800a、视频片段800b、视频片段800c、视频片段800d、视频片段800e、视频片段800f、视频片段800g和视频片段800h。其中,视频片段800a对应的片段属性标签为{远景},视频片段800b对应的片段属性标签为{人物近景},视频片段800c对应的片段属性标签为{远景},视频片段800d对应的片段属性标签为{人物近景},视频片段800e对应的片段属性标签为{人物近景},视频片段800f对应的片段属性标签为{远景},视频片段800g对应的片段属性标签为{物体近景},视频片段800h对应的片段属性标签为{人物特写}。
其中,可以理解的是,若从图8A中的4个模板片段中获取模板片段80a作为目标模板片段(例如,目标模板片段1),则该目标模板片段1的队列位置可以为位置1(即目标队列位置为位置1),该目标模板片段的模板属性标签可以为{远景}(即目标模板属性标签为{远景})。在图8B的8个视频片段中筛选出与该目标模板属性标签相匹配的片段属性标签为{远景},{远景}对应的视频片段为视频片段800a、视频片段800c和视频片段800f,则目标模板片段1对应的候选视频片段为视频片段800a、视频片段800c和视频片段800f。进一步的,在计算这3个候选视频片段与目标模板片段1之间的相似阈值之后,若视频片段800a与该目标模板片段1之间的相似阈值为最大相似阈值,则将视频片段800a确定为与该目标模板片段1相匹配的目标候选视频片段(例如,目标候选视频片段1)。
同理,可以理解的是,若从图8A中的4个模板片段中获取模板片段80b作为目标模板片段(例如,目标模板片段2),则该目标模板片段2的队列位置可以为位置2(即目标队列位置为位置2),该目标模板片段的模板属性标签可以为{人物特写}(即目标模板属性标签为{人物特写})。在图8B的8个视频片段中筛选出与该目标模板属性标签相匹配的片段属性标签为{人物特写},{人物特写}对应的视频片段为视 频片段800h,则将视频片段800h确定为与该目标模板片段2相匹配的目标候选视频片段(例如,目标候选视频片段2)。
同理,可以理解的是,若从图8A中的4个模板片段中获取模板片段80c作为目标模板片段(例如,目标模板片段3),则该目标模板片段3的队列位置可以为位置3(即目标队列位置为位置3),该目标模板片段的模板属性标签可以为{人物近景}(即目标模板属性标签为{人物近景})。在图8B的8个视频片段中筛选出与该目标模板属性标签相匹配的片段属性标签为{人物近景},{人物近景}对应的视频片段为视频片段800d和视频片段800e,则目标模板片段3对应的候选视频片段为视频片段800d和视频片段800e。进一步的,在计算这2个候选视频片段与目标模板片段3之间的相似阈值之后,若视频片段800e与该目标模板片段3之间的相似阈值为最大相似阈值,则将视频片段800e确定为与该目标模板片段3相匹配的目标候选视频片段(例如,目标候选视频片段3)。
同理,可以理解的是,若从图8A的4个模板片段中获取模板片段80d作为目标模板片段(例如,目标模板片段4),则该目标模板片段4的队列位置可以为位置4(即目标队列位置为位置4),该目标模板片段的模板属性标签可以为{物体近景}(即目标模板属性标签为{物体近景})。在图8B的8个视频片段中筛选出与该目标模板属性标签相匹配的片段属性标签为{物体近景},{物体近景}对应的视频片段为视频片段800g,则将视频片段800g确定为与该目标模板片段4相匹配的目标候选视频片段(例如,目标候选视频片段4)。
因此,位置1对应的目标候选视频片段1为视频片段800a,位置2对应的目标候选视频片段2为视频片段800h,位置3对应的目标候选视频片段3为视频片段800e,位置4对应的目标候选视频片段4为视频片段800g,则可以基于位置1、位置2、位置3和位置4,由视频片段800a、视频片段800h、视频片段800e和视频片段800g确定视频素材片段。其中,模板标签序列为模板片段对应的模板属性标签所构成的序列,这里的模板标签序列可以表示为{远景、人物特写、人物近景、物体近景};目标标签序列为与模板片段匹配的视频片段对应的片段属性标签所构成的序列,这里的目标标签序列可以表示为{远景、人物特写、人物近景、物体近景}。
其中,可以理解的是,目标模板片段1可以与目标候选视频片段1具有相似的视频播放效果,目标模板片段2可以与目标候选视频片段2具有相似的视频播放效果,目标模板片段3可以与目标候选视频片段3具有相似的视频播放效果,目标模板片段4可以与目标候选视频片段4具有相似的视频播放效果,因此,视频素材片段可以与上述模板片段具有相同的视频播放效果。
应当理解,服务器可以将与目标标签序列相关联的所有目标候选视频片段进行视频拼接处理,得到与N个模板片段相关联的拼接视频数据。进一步的,服务器可以获取与N个模板片段相关联的模板音频数据,通过音视频合成组件将模板音频数据和拼接视频数据进行音视频合并处理,得到满足片段匹配条件的视频素材片段。
其中,将每个目标候选视频片段进行视频拼接处理,以及将模板音频数据和拼接视频数据进行音视频合并处理的工具可以为同一个工具,这个工具可以为上述音视频合成组件。这里的音视频合成组件可以为ffmpeg工具,也可以为其他第三方具有视频解封装能力的软件工具。这里将不再对视频解封装组件进行一一举例。
步骤S105,将视频数据以及视频素材片段推送至目标用户对应的应用客户端,以使应用客户端输出视频数据以及视频素材片段。
其中,可以理解的是,应用客户端在接收到视频数据以及视频素材片段后,可以在应用显示界面中播放该视频数据以及视频素材片段。在一实施方式中,当应用客户端播放所述视频数据的同时,该应用客户端还可以用于展示每个视频素材片段的缩略图。这里将不对应用客户端输出视频素材片段的具体实现形式 进行限定。
在本申请实施例中,服务器在获取到目标用户请求的某个视频的视频数据时,可以对该视频数据进行视频分析,以得到该视频数据的一个或者多个视频片段。其中,可以理解的是,本申请实施例所涉及的视频分析主要包括:视频分镜和属性分析。其中,视频分镜主要是指可以将该视频数据划分成一个或者多个分镜片段,这样,服务器可以进一步对每个分镜片段的片段内容进行属性分析,以得到每个分镜片段的片段属性标签,从而将具备片段属性标签的分镜片段统称为前述视频片段,应当理解,一个视频片段可以对应一个片段属性标签。进一步的,服务器可以在获取到目标用户的用户画像时,快速根据该用户画像确定出与该目标用户相关联的视频模板,进而可以在获取到这个视频模板所映射的模板片段(比如,热门短视频)以及模板片段对应的模板标签序列时,智能在视频片段中筛选满足片段匹配条件的视频片段,从而可以将筛选出的满足片段匹配条件的视频片段作为目标视频的视频素材片段。其中,可以理解的是,这里的视频素材片段的片段属性标签所构成的目标标签序列可以与模板标签序列相同,以确保该视频素材片段与上述模板片段具有相同的视频播放效果。然后,服务器可以将上述视频数据以及视频素材片段智能推送至目标用户对应的应用客户端,以使应用客户端可以输出视频数据和视频素材片段。由此可见,本申请实施例通过视频分析(例如,视频分镜和属性分析等),可以快速得到携带片段属性标签的一个或者多个视频片段。这样,对于这些视频片段而言,可以在根据用户画像智能确定出一个或者多个视频模板时,根据这些视频模板的模板标签序列来分别对这些视频片段进行筛选,以快速得到与该视频模板具有相似视频播放效果的视频片段,进而可以快速合成得到视频素材片段(比如,可以快速得到能够推送给目标用户的短视频),并且可以随着视频模板的增加和更新,重复利用这些视频片段的分镜和属性信息,减少对目标视频中视频帧的识别和处理,提高短视频的生成效率,节省针对不同用户不断生成和分发大量短视频的计算成本,节省服务器的计算资源。
进一步的,请参见图9,图9是本申请实施例提供的一种视频数据处理方法的流程示意图。如图9所示,该方法可以由应用客户端和服务器共同执行,该应用客户端可以为上述图2所对应实施例中的用户终端X中运行的应用客户端,该服务器可以为上述图2所对应实施例中的服务器。其中,该方法可以包括以下步骤:
步骤S201,应用客户端可以响应目标用户针对目标视频执行的播放操作,生成用于请求播放目标视频的视频播放请求,将视频播放请求发送给服务器;
其中,视频播放请求中可以携带目标视频的视频标识,这里的视频标识用于指示服务器获取目标用户所请求播放的目标视频的视频数据。其中,播放操作可以包括点击、长按、滑动等接触性操作,也可以包括语音、手势等非接触性操作,本申请在此不做限定。
步骤S202,服务器获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;
步骤S203,服务器基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;
步骤S204,服务器基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;
步骤S205,服务器按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为目标视频的视频素材片段;
步骤S206,服务器将视频数据以及视频素材片段推送至目标用户对应的应用客户端;
为便于理解,请参见图10,图10是本申请实施例提供的一种生成视频素材片段的流程示意图。如图10所示,服务器可以在获取到精彩短视频(即模板视频)时,对该精彩短视频进行视频分析,以得到该精彩短视频的一个或者多个视频片段,进而可以将该精彩短视频的一个或者多个视频片段作为模板片段。其中,可以理解的是,本申请实施例所涉及的视频分析主要包括:视频分镜和属性分析。其中,视频分镜主要是指可以将该精彩短视频的视频数据划分成一个或者多个分镜片段。这样,服务器可以进一步对每个分镜片段的片段内容进行属性分析(即分镜信息抽取),以得到每个分镜片段的模板属性标签(即图10所示的场景标签、人物标签(即角色标签)和远近镜标签),从而将具备模板属性标签的分镜片段统称为前述模板片段,从而可以基于模板属性标签确定热门集锦序列(即镜头序列记录)。应当理解,一个模板片段可以对应一个模板属性标签。其中,图10所示的集锦序列库中的热门集锦序列1可以为模板片段1对应的模板属性标签,热门集锦序列2可以为模板片段2对应的模板属性标签,热门集锦序列3可以为模板片段3对应的模板属性标签。
应当理解,本申请实施例可以将模板视频(即上述精彩短视频)的模板片段、模板片段的模板标签序列和模板音频数据(即音乐)统称为视频模板。
如图10所示,服务器可以在获取到电视剧(即目标视频)时,对该电视剧进行视频分镜和属性分析,以得到该电视剧的一个或者多个视频片段。应当理解,一个视频片段可以对应一个片段属性标签。这样,服务器可以从集锦序列库中获取一个或多个热门集锦序列(即序列采样),进而可以按照选取的热门集锦序列确定模板片段以及模板片段对应的模板标签序列,对目标视频的视频片段进行筛选和排序,以得到筛选后的视频片段(即基于素材匹配的分片段镜序列排列),进而可以根据筛选出的这些视频片段所构成的拼接视频数据以及模板片段的模板音频数据,智能生成与模板片段相似的视频素材片段。
其中,通过抽取各短视频平台中的精彩短视频,并获取这些精彩短视频对应的视频模板,可以实现视频模板的连续多日积累。将电视剧根据视频模板生成相应样式的一个或多个视频素材片段,可以丰富最终生成的视频素材片段的样式。其中,一个电视剧可以根据多个视频模板生成多种样式的视频素材片段,可以供视频推荐场景中千人千面的推荐选择,且对于每个视频模板,通过深度学习和图像分析算法可以对精彩短视频和电视剧进行视频分析与视频匹配,可以达到自动化分析的目标。此外,对于新的电视剧,只需要有限的迁移能力即可完成该电视剧的解析,使得新的电视剧的视频素材片段生成难度将低,生成视频素材片段的方法的可迁移性大。
应当理解,服务器对电视剧进行视频分镜和属性分析的具体过程,可以参见上述步骤S102的描述,这里将不再继续进行赘述。应当理解,服务器对精彩短视频进行视频分镜和属性分析的具体过程,可以参见服务器对电视剧进行视频分镜和属性分析的描述,这里将不再继续进行赘述。
步骤S207,应用客户端在应用显示界面中输出视频数据以及视频素材片段。
具体的,应用客户端可以接收服务器基于视频播放请求返回的目标视频的视频数据,以及与目标视频相关联的视频素材片段,并可以在应用客户端的应用显示界面中确定用于播放视频数据的视频播放界面,进而可以在视频播放界面中播放视频数据。进一步的,应用客户端可以响应针对应用显示界面的触发操作,在应用客户端的应用显示界面中播放相应的视频素材片段。其中,该触发操作可以包括点击、长按、滑动等接触性操作,也可以包括语音、手势等非接触性操作,本申请在此不做限定。在一实施方式中,可以理解的是,应用客户端在获取到视频素材片段之后,还可以在应用显示界面中展示每个视频素材片段的缩略图,或者在应用显示界面中动态播放每个视频素材片段的动画,这里将不对这些视频素材片段的具体展示形式进行限定。
为便于理解,请参见图11,图11是本申请实施例提供的一种前后端交互的流程示意图。可以理解的是,上述应用客户端可以运行在图11所示的前端B。目标用户针对前端B的应用客户端中的目标视频(比 如,目标用户感兴趣的视频)执行的播放操作,即为前端B输入目标视频。进而服务器(即后端)可以基于视频模板,生成与该目标视频相关联的一个或多个视频素材片段(即后端生成)。进而服务器可以将这个目标视频的视频数据和与这个目标视频相关联的一个或者多个视频素材片段(例如,这个视频的视频花絮等)返回给前端B,即在前端B的应用显示界面中显示服务器返回的视频数据和视频素材片段。应当理解,这里的视频模板可以是由该服务器基于该目标用户的用户画像所确定的。
可以理解的是,如图11所示,前端A可以为视频剪辑员对应的另一用户终端。在对前端A输入的精彩短视频进行视频分析后,该视频剪辑员可以在视频分析得到的视频片段中,选择一个或多个视频片段作为模板片段,进而可以基于这些模板片段确定视频模板(即挖掘精彩视频模板)。其中,前端A可以接收精彩短视频的输入,然后将该精彩短视频对应的视频模板(即精彩视频模板)上传给服务器保存(即后端保存)。
应当理解,上述前端B与前端A还可以为同一个用户终端,即前端B(或前端A)可以是精彩短视频的输入方,也可以是目标视频的输入方。
为便于理解,请参见图12A,图12A是本申请实施例提供的一种输出视频素材片段的场景示意图。如图12A所示,这里的应用显示界面120a可以为上述图2所对应实施例中的应用显示界面。应用显示界面120a中可以包含用于播放目标视频的视频播放界面1,还可以包括用于展示或播放视频素材片段的短视频推荐列表(例如,短视频推荐列表1)。该短视频推荐列表1中至少可以包含与该目标视频相关联的视频素材片段。这里的视频素材片段可以为上述第一业务数据库中与目标视频相关联的视频素材片段。在目标用户针对应用显示界面120a执行触发操作(例如,图12A所示的滑动操作)后,应用客户端可以在应用显示界面120b的集锦推荐部分展示或播放上述短视频推荐列表1中的视频素材片段。其中,在一实施方式中,当应用客户端在视频播放界面1中播放目标视频时,该应用客户端还可以遍历播放(或同步播放)短视频推荐列表1中的视频素材片段。如图12A所示,该视频推荐列表1中具体可以包括与该目标视频相关联的N个视频素材片段。这里的N个视频素材片段具体可以为图12A所示的3个视频素材片段。比如,这3个视频素材片段可以具体包括:视频素材片段A1、视频素材片段A2和视频素材片段A3。
在一实施方式中,在目标用户针对应用显示界面120a中的业务推荐控件执行触发操作(例如,点击操作)后,应用客户端可以在应用显示界面120b的集锦推荐部分展示或播放上述短视频推荐列表1中的视频素材片段,例如,应用显示界面120b中的视频素材片段A1、视频素材片段A2和视频素材片段A3等。
为便于理解,请参见图12B,图12B是本申请实施例提供的一种更新视频素材片段的场景示意图。如图12B所示,在目标用户针对上述图12A的视频素材片段A1执行触发操作(例如,点击操作)时,服务器可以将这个视频素材片段A1的视频数据(例如,视频数据J)和与这个视频数据J相关联的一个或者多个视频素材片段(例如,视频素材片段C1、视频素材片段C2和视频素材片段C3)返回给应用客户端,以在应用客户端中播放这个视频数据J。在一实施方式中,应用客户端还可以在播放视频素材片段A1的视频数据J时,一并显示接收到的这些视频素材片段,得到应用显示界面120c。
这里的应用显示界面120c中可以包含用于播放视频数据J的视频播放界面2,还可以包括用于展示视频素材片段的短视频推荐列表(例如,短视频推荐列表2)。该短视频推荐列表2中至少可以包含与该视频数据J关联的视频素材片段。在目标用户针对应用显示界面120c中的业务推荐控件执行触发操作(例如,图12B所示的点击操作)后,应用客户端可以在应用显示界面120d的集锦推荐部分展示或播放上述短视频推荐列表2中的视频素材片段。这里的视频素材片段可以为上述第二业务数据库中与视频素材片段A1具有同一视频模板的视频素材片段。如图12B所示,该短视频推荐列表2中具体可以包括与该视频数据J相关联的M个视频素材片段。这里的M个视频素材片段具体可以为图12B所示的3个视频素材片段。比 如,这3个视频素材片段可以具体包括:视频素材片段C1、视频素材片段C2和视频素材片段C3。
在一实施方式中,在目标用户针对应用显示界面120c执行触发操作(例如,滑动操作)后,应用客户端可以在应用显示界面120d的集锦推荐部分展示或播放上述短视频推荐列表2中的视频素材片段,例如,应用显示界面120d中的视频素材片段C1、视频素材片段C2和视频素材片段C3等。
应当理解,当目标用户在上述应用客户端中观看完上述视频素材片段A1之后,还可以智能在应用客户端中为上述目标用户遍历播放上述短视频推荐列表2中的这些视频素材片段。比如,当目标用户在上述应用客户端中观看完上述视频素材片段A1时,服务器还可以将短视频推荐列表2中的多个视频素材片段中的视频素材片段C1输出至应用客户端,以在该应用客户端中实现对该视频素材片段C1的智能播放。在一实施方式中,该应用客户端还可以在将应用客户端的视频播放界面1中所播放的视频数据更新为视频素材片段A1时,记录目标视频的当前播放进度(例如,时刻T),以在播放完视频素材片段A1后,从目标视频的时刻T开始继续对目标视频进行播放。
其中,可以理解的是,应用客户端可以根据目标视频的当前播放进度,实时动态调整视频素材片段在短视频推荐列表中的位置,以为目标用户推荐不同排序的视频素材片段。比如,若在当前播放进度之前,包括组成视频素材片段的全部视频片段,即组成视频素材片段的全部视频片段在当前时刻已经观看完成,则可以将该视频素材片段排列在短视频推荐列表的前面,即实现剧情回放。在一实施方式中,应用客户端还可以根据当前视频素材片段在其它用户终端中的应用客户端上的播放次数,来将视频推荐列表中的视频素材片段进行排序。若某个视频素材片段的播放总次数比较高,则表示这个视频素材片段的质量比较高,则可以为目标用户优先推荐该视频素材片段,即将该视频素材片段排列在短视频推荐列表的前面。
由此可见,本申请实施例通过对视频数据进行视频分析(例如,视频分镜和属性分析等),可以快速得到携带片段属性标签的一个或者多个视频片段。这样,对于这些视频片段而言,可以在根据用户画像准确地确定出一个或者多个视频模板时,智能根据这些视频模板的模板标签序列来分别对这些视频片段进行筛选,以快速得到与该视频模板具有相似视频播放效果的视频片段,进而可以快速合成得到视频素材片段(比如,可以快速得到能够展示给目标用户的短视频),并且可以随着视频模板的增加和更新,重复利用这些视频片段的分镜和属性信息,减少对目标视频中视频帧的识别和处理,提高短视频的生成效率,节省针对不同用户不断生成和分发大量短视频的计算成本,节省服务器的计算资源。
进一步的,请参见图13,图13是本申请实施例提供的一种视频数据处理装置的结构示意图。视频数据处理装置1可以包括:片段生成模块30、模板获取模块40、素材确定模块50、数据发送模块60;进一步的,视频数据处理装置1还可以包括:请求接收模块10、数据查找模块20。
片段生成模块30,用于获取目标用户请求的目标视频的视频数据,对视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段。
其中,片段生成模块30包括:模型获取单元301、分镜获取单元302、标签确定单元303、片段确定单元304。
模型获取单元301,用于获取目标用户请求的目标视频的视频数据以及与视频数据相关联的网络识别模型;
分镜获取单元302,用于通过视频切分组件将视频数据对应的视频序列进行分镜处理,得到与视频序列相关联的多个分镜片段;
其中,分镜获取单元302包括:组件获取子单元3021、图像匹配子单元3022、分镜创建子单元3023、匹配完成子单元3024、分镜确定子单元3025;
组件获取子单元3021,用于通过视频切分组件在视频序列中确定用于作为聚类质心的第一视频帧,创 建第一视频帧所属的分镜簇的分镜簇信息;
图像匹配子单元3022,用于在视频序列中将除第一视频帧之外的视频帧确定为第二视频帧,基于轮询机制依次获取第二视频帧中的每个第二视频帧,确定每个第二视频帧与第一视频帧的图像相似度;
分镜创建子单元3023,用于若第一视频帧与一第二视频帧的图像相似度大于或者等于聚类阈值,则将图像相似度大于或者等于聚类阈值的第二视频帧划分到第一视频帧所属的分镜簇;
匹配完成子单元3024,用于若第一视频帧与一第二视频帧的图像相似度小于聚类阈值,则用图像相似度小于聚类阈值的第二视频帧更新第一视频帧,创建更新后的第一视频帧所属的分镜簇的分镜簇信息,将更新后的第一视频帧依次与未匹配的第二视频帧进行图像相似度匹配,直到视频序列中的视频帧均完成图像相似度匹配时,得到视频序列中的视频帧所属的分镜簇的分镜簇信息;
分镜确定子单元3025,用于基于视频序列中的视频帧所属的分镜簇的分镜簇信息,将视频序列中的视频帧组成多个分镜片段。
其中,组件获取子单元3021、图像匹配子单元3022、分镜创建子单元3023、匹配完成子单元3024以及分镜确定子单元3025的具体实现方式,可以参见上述图3所对应实施例中对步骤S101的描述,这里将不再进行赘述。
标签确定单元303,用于将多个分镜片段输入至网络识别模型,由网络识别模型基于所述多个预设的片段属性标签,对多个分镜片段进行属性分析,得到多个分镜片段对应的片段属性标签。
其中,网络识别模型至少包括:具有第一属性标签提取功能的第一网络模型、具有第二属性标签提取功能的第二网络模型和具有第三属性标签提取功能的第三网络模型。
标签确定单元303包括:第一分析子单元3031、人脸检测子单元3032、第二分析子单元3033、第三分析子单元3034、标签分析子单元3035;
第一分析子单元3031,用于将多个分镜片段输入第一网络模型,通过第一网络模型对多个分镜片段中的每个分镜片段进行远近景分析,得到多个分镜片段的远近景标签,将多个分镜片段的远近景标签作为第一网络模型输出的第一属性标签,将具有第一属性标签的分镜片段作为第一类分镜片段;
人脸检测子单元3032,用于将第一类分镜片段输入第二网络模型,由第二网络模型对第一类分镜片段中的每个分镜片段进行人脸检测,得到人脸检测结果;
第二分析子单元3033,用于若人脸检测结果指示第一类分镜片段中存在目标角色的人脸,则在第一类分镜片段中将存在目标角色的人脸所对应的分镜片段作为第二类分镜片段,通过第二网络模型确定第二类分镜片段中的目标角色所属的角色标签,将目标角色所属的角色标签确定为第二类分镜片段的第二属性标签;目标角色为目标视频中的一个或者多个角色;
第三分析子单元3034,用于在第一类分镜片段中将除第二类分镜片段之外的分镜片段,确定为第三类分镜片段,将第三类分镜片段输入第三网络模型,由第三网络模型对第一类分镜片段中的每个分镜片段进行场景检测,得到第三类分镜片段的第三属性标签;
标签分析子单元3035,用于根据第一类分镜片段的第一属性标签、第二类分镜片段的第二属性标签、以及第三类分镜片段的第三属性标签,确定多个分镜片段中的每个分镜片段对应的片段属性标签。
其中,第一分析子单元3031、人脸检测子单元3032、第二分析子单元3033、第三分析子单元3034以及标签分析子单元3035的具体实现方式,可以参见上述图3所对应实施例中对步骤S101的描述,这里将不再进行赘述。
片段确定单元304,用于将具备片段属性标签的分镜片段确定为视频数据的视频片段。
其中,模型获取单元301、分镜获取单元302、标签确定单元303以及片段确定单元304的具体实现方式,可以参见上述图3所对应实施例中对步骤S101的描述,这里将不再进行赘述。
模板获取模块40,用于基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成。
其中,模板获取模块40包括:行为提取单元401、行为分析单元402、模板分析单元403;
行为提取单元401,用于获取目标用户的行为日志表,从行为日志表中提取与目标用户相关联的行为数据信息;
行为分析单元402,用于对行为数据信息进行用户画像分析,得到用于表征目标用户的用户画像,基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板;
模板分析单元403,用于获取视频模板中预先确定的所述至少一个模板片段以及所述模板标签序列。
其中,行为提取单元401,行为分析单元402以及模板分析单元403的具体实现方式,可以参见上述图3所对应实施例中对步骤S102的描述,这里将不再进行赘述。
素材确定模块50,用于基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的一个视频片段,按照至少一个模板片段中各个模板片段的位置,将匹配的至少一个视频片段进行拼接,作为目标视频的视频素材片段。
其中,模板片段的数量为N个,N为大于1的正整数;模板标签序列包含N个序列位置,一个序列位置对应一个模板属性标签,且一个模板属性标签对应一个模板片段。
素材确定模块50包括:标签确定单元501、标签筛选单元502、片段匹配单元503、素材生成单元504;
标签确定单元501,用于将N个模板片段作为目标模板片段,在模板标签序列中将目标模板片段的队列位置确定为目标队列位置,将目标队列位置对应的模板属性标签确定为目标模板属性标签;
标签筛选单元502,用于在多个视频片段对应的片段属性标签中,筛选与目标模板属性标签相匹配的片段属性标签,将筛选出的片段属性标签所对应的一个或多个视频片段确定为候选视频片段;
片段匹配单元503,用于将候选视频片段中的每个候选视频片段与目标模板片段进行相似分析,得到每个候选视频片段与目标模板的相似阈值,在相似阈值中确定最大相似阈值,将最大相似阈值所对应的候选视频片段确定为目标模板片段相匹配的目标候选视频片段;
素材生成单元504,用于基于目标模板片段在模板标签序列中的目标队列位置,确定目标候选视频片段对应的片段属性标签所构成的目标标签序列,将与目标标签序列相关联的所有目标候选视频片段进行拼接处理,得到视频素材片段。
其中,素材生成单元504包括:视频拼接子单元5041、素材合成子单元5042;
视频拼接子单元5041,用于将与目标标签序列相关联的所有目标候选视频片段进行视频拼接处理,得到与N个模板片段相关联的拼接视频数据;
素材合成子单元5042,用于获取与N个模板片段相关联的模板音频数据,通过音视频合成组件将模板音频数据和拼接视频数据进行音视频合并处理,得到视频素材片段。
其中,视频拼接子单元5041以及素材合成子单元5042的具体实现方式,可以参见上述图3所对应实施例中对步骤S103、S104的描述,这里将不再进行赘述。
其中,标签确定单元501、标签筛选单元502、片段匹配单元503以及素材生成单元504的具体实现方式,可以参见上述图3所对应实施例中对步骤S103、S104的描述,这里将不再进行赘述。
数据发送模块60,用于将视频数据以及视频素材片段推送至目标用户对应的应用客户端,以使应用客户端输出视频数据以及视频素材片段。
在一实施方式中,请求接收模块10,用于接收应用客户端发送的视频播放请求;视频播放请求是由应用客户端响应目标用户针对目标视频执行的播放操作所生成的;
数据查找模块20,用于从视频播放请求中提取目标视频的视频标识,基于视频标识在视频业务数据库中查找目标视频对应的业务视频数据,将查找到的业务视频数据作为应用客户端中的目标视频的视频数据。
其中,片段生成模块30、模板获取模块40、素材确定模块50以及数据发送模块60的具体实现方式,可以参见上述图3所对应实施例中对步骤S101-步骤S105的描述,这里将不再进行赘述。在一实施方式中,请求接收模块10以及数据查找模块20的具体实现方式,可以参见上述图9所对应实施例中对步骤S201和步骤S207的描述,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
进一步的,请参见图14,图14是本申请实施例提供的一种视频数据处理装置的结构示意图。视频数据处理装置2可以包括:数据获取模块70、数据输出模块80;
数据获取模块70,用于响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取目标视频的视频数据,以及与目标视频相关联的视频素材片段;视频素材片段是由服务器对视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照至少一个模板片段中各个模板片段的模板属性标签在模板标签序列中的位置,将匹配的至少一个视频片段进行拼接得到的。
其中,数据获取模块70包括:请求发送单元701、数据接收单元702;
请求发送单元701,用于响应目标用户针对应用客户端中的目标视频执行的播放操作,生成用于请求播放目标视频的视频播放请求,将视频播放请求发送给服务器;视频播放请求中携带目标视频的视频标识;视频标识用于指示服务器获取目标用户所请求播放的目标视频的视频数据;
数据接收单元702,用于接收服务器基于视频播放请求返回的视频数据,以及与目标视频相关联的视频素材片段;视频素材片段是由服务器在根据目标用户的用户画像确定出视频模板时,根据视频模板对视频数据进行视频分析以及视频匹配后所得到的,用户画像是由目标用户在应用客户端中的用户行为信息所确定的。
其中,请求发送单元701以及数据接收单元702的具体实现方式,可以参见上述图9所对应实施例中对步骤S201的描述,这里将不再进行赘述。
数据输出模块80,用于在应用客户端的应用显示界面中输出视频数据以及视频素材片段。
其中,数据输出模块包括:视频播放单元801、素材输出单元802;
视频播放单元801,用于在应用客户端的应用显示界面中确定用于播放视频数据的视频播放界面,在视频播放界面中播放视频数据;
素材输出单元802,用于响应针对应用显示界面的触发操作,在应用显示界面中播放视频素材片段。
其中,视频播放单元801以及素材输出单元802的具体实现方式,可以参见上述图9所对应实施例中对步骤S207的描述,这里将不再进行赘述。
其中,数据获取模块70以及数据输出模块80的具体实现方式,可以参见上述图9所对应实施例中对步骤S201和步骤S207的描述,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图15,图15是本申请实施例提供的一种计算机设备的结构示意图。如图15所示,该计算机设备2000可以包括:处理器2001、网络接口2004和存储器2005,此外,上述计算机设备2000还可以包括:用户接口2003和至少一个通信总线2002。其中,通信总线2002用于实现这些组件之间的连接通信。用户 接口2003还可以包括标准的有线接口、无线接口。在一实施方式中,网络接口2004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器2005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。在一实施方式中,存储器2005还可以是至少一个位于远离前述处理器2001的存储装置。如图15所示,作为一种计算机可读存储介质的存储器2005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图15所示的计算机设备2000中,网络接口2004可提供网络通讯功能;而用户接口2003主要用于为用户提供输入的接口;而处理器2001可以用于调用存储器2005中存储的设备控制应用程序。
应当理解,本申请实施例中所描述的计算机设备2000可以为服务器或用户终端,这里将不对其进行限定。可以理解的是,该计算机设备2000可以用于执行前文图3或图9所对应实施例中对视频数据处理方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且计算机可读存储介质中存储有前文提及的视频数据处理装置1或视频数据处理装置2所执行的计算机程序,且计算机程序包括程序指令,当处理器执行程序指令时,能够执行前文图3或图9所对应实施例中对视频数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
进一步的,请参见图16,图16是本申请实施例还提供一种视频数据处理系统。该视频数据处理系统3中可以包含服务器3a和用户终端3b,所述服务器3a可以为前述图13所对应实施例中的视频数据处理装置1;所述用户终端3b可以为前述图14所对应实施例中的视频数据处理装置2。可以理解的是,对采用相同方法的有益效果描述,也不再进行赘述。
此外,需要说明的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或者计算机程序可以包括计算机指令,该计算机指令可以存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器可以执行该计算机指令,使得该计算机设备执行前文图3或图9所对应实施例中对视频数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节,请参照本申请方法实施例的描述。
可以理解的是,在本申请的具体实施方式中,涉及到用户相关联的行为数据、用户画像等相关的数据,当本申请以上实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储存储器(Read-Only Memory,ROM)或随机存储存储器(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (15)

  1. 一种视频数据处理方法,由计算机设备执行,所述方法包括:
    获取目标用户请求的目标视频的视频数据,对所述视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;
    基于所述目标用户的用户画像,从视频模板数据库中确定与所述目标用户相关联的视频模板,并获取所述视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;
    基于所述至少一个模板片段的模板属性标签和所述多个视频片段对应的片段属性标签,在所述多个视频片段中筛选与所述至少一个模板片段的模板属性标签匹配的至少一个视频片段;
    按照所述至少一个模板片段中各个模板片段的模板属性标签在所述模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为所述目标视频的视频素材片段;
    将所述视频数据以及所述视频素材片段推送至所述目标用户对应的应用客户端,以使所述应用客户端输出所述视频数据以及所述视频素材片段。
  2. 根据权利要求1所述的方法,其中,在所述获取目标用户请求的目标视频的视频数据之前,所述方法还包括:
    响应于接收到应用客户端发送的针对目标视频的视频播放请求,从所述视频播放请求中提取所述目标视频的视频标识;
    基于所述视频标识,在视频业务数据库中查找所述目标视频对应的业务视频数据,将查找到的业务视频数据作为所述应用客户端中的目标视频的视频数据。
  3. 根据权利要求1所述的方法,其中,所述对所述视频数据进行视频分析得到多个视频片段,包括:
    通过视频切分组件将所述视频数据对应的视频序列进行分镜处理,得到与所述视频序列相关联的多个分镜片段;
    将所述多个分镜片段输入至网络识别模型,由所述网络识别模型基于所述多个预设的片段属性标签,对所述多个分镜片段进行属性分析,得到所述多个分镜片段对应的片段属性标签;
    将具备片段属性标签的所述多个分镜片段确定为所述视频数据的所述多个视频片段。
  4. 根据权利要求3所述的方法,其中,所述通过视频切分组件将所述视频数据对应的视频序列进行分镜处理,得到与所述视频序列相关联的多个分镜片段,包括:
    通过所述视频切分组件在所述视频序列中确定用于作为聚类质心的第一视频帧,创建所述第一视频帧所属的分镜簇的分镜簇信息;
    在所述视频序列中将除所述第一视频帧之外的视频帧确定为第二视频帧,基于轮询机制依次获取所述第二视频帧中的每个第二视频帧,确定每个第二视频帧与所述第一视频帧的图像相似度;
    若所述第一视频帧与一第二视频帧的图像相似度大于或者等于聚类阈值,则将所述图像相似度大于或者等于所述聚类阈值的第二视频帧划分到所述第一视频帧所属的分镜簇;
    若所述第一视频帧与一第二视频帧的图像相似度小于所述聚类阈值,则用所述图像相似度小于所述聚类阈值的第二视频帧更新所述第一视频帧,创建更新后的第一视频帧所属的分镜簇的分镜簇信息,将所述更新后的第一视频帧依次与未匹配的第二视频帧进行图像相似度匹配,直到所述视频序列中的视频帧均完成所述图像相似度匹配时,得到所述视频序列中的视频帧所属的分镜簇的分镜簇信息;
    基于所述视频序列中的视频帧所属的分镜簇的分镜簇信息,将所述视频序列中的视频帧组成所述多个分镜片段。
  5. 根据权利要求3所述的方法,其中,所述网络识别模型至少包括:具有第一属性标签提取功能的第一网络模型、具有第二属性标签提取功能的第二网络模型和具有第三属性标签提取功能的第三网络模型;
    所述将所述多个分镜片段输入至网络识别模型,由所述网络识别模型基于所述多个预设的片段属性标签,对所述多个分镜片段进行属性分析,得到所述多个分镜片段对应的片段属性标签,包括:
    将所述多个分镜片段输入所述第一网络模型,通过所述第一网络模型对所述多个分镜片段中的每个分镜片段进行远近景分析,得到所述多个分镜片段的远近景标签,将所述多个分镜片段的远近景标签作为所述第一网络模型输出的第一属性标签,将具有所述第一属性标签的分镜片段作为第一类分镜片段;
    将所述第一类分镜片段输入所述第二网络模型,由所述第二网络模型对所述第一类分镜片段中的每个分镜片段进行人脸检测,得到人脸检测结果;
    若所述人脸检测结果指示所述第一类分镜片段中存在目标角色的人脸,则在所述第一类分镜片段中将存在所述目标角色的人脸所对应的分镜片段作为第二类分镜片段,通过所述第二网络模型确定所述第二类分镜片段中的目标角色所属的角色标签,将所述目标角色所属的角色标签确定为所述第二类分镜片段的第二属性标签;所述目标角色为所述目标视频中的一个或者多个角色;
    在所述第一类分镜片段中将除所述第二类分镜片段之外的分镜片段,确定为第三类分镜片段,将所述第三类分镜片段输入所述第三网络模型,由所述第三网络模型对所述第一类分镜片段中的每个分镜片段进行场景检测,得到所述第三类分镜片段的第三属性标签;
    根据所述第一类分镜片段的第一属性标签、所述第二类分镜片段的第二属性标签、以及所述第三类分镜片段的第三属性标签,确定所述多个分镜片段中的每个分镜片段对应的片段属性标签。
  6. 根据权利要求1所述的方法,其中,所述基于目标用户的用户画像,从视频模板数据库中确定与所述目标用户相关联的视频模板,并获取所述视频模板中预先确定的至少一个模板片段以及模板标签序列,包括:
    获取所述目标用户的行为日志表,从所述行为日志表中提取与所述目标用户相关联的行为数据信息;
    对所述行为数据信息进行用户画像分析,得到用于表征所述目标用户的用户画像;
    基于所述目标用户的用户画像,从所述视频模板数据库中确定与所述目标用户相关联的视频模板,并获取所述视频模板中预先确定的所述至少一个模板片段以及所述模板标签序列。
  7. 根据权利要求1所述的方法,其中,所述至少一个模板片段的数量为N个,所述N为大于1的正整数;所述模板标签序列包含N个序列位置,一个序列位置对应一个模板属性标签,且一个模板属性标签对应一个模板片段;
    所述基于所述至少一个模板片段的模板属性标签和所述多个视频片段对应的片段属性标签,在所述多个视频片段中筛选与所述至少一个模板片段的模板属性标签匹配的至少一个视频片段,包括:
    将N个所述模板片段作为目标模板片段,在所述模板标签序列中将所述目标模板片段的队列位置确定为目标队列位置,将所述目标队列位置对应的模板属性标签确定为目标模板属性标签;
    在所述多个视频片段对应的片段属性标签中,筛选与所述目标模板属性标签相匹配的片段属性标签,将筛选出的片段属性标签所对应的一个或多个视频片段确定为候选视频片段;
    将所述候选视频片段中的每个候选视频片段与所述目标模板片段进行相似分析,得到所述每个候选视频片段与所述目标模板片段的相似阈值,在所述相似阈值中确定最大相似阈值,将所述最大相似阈值所对应的候选视频片段确定为所述目标模板片段相匹配的目标候选视频片段;
    所述按照所述至少一个模板片段中各个模板片段的模板属性标签在所述模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为所述目标视频的视频素材片段,包括:
    基于所述目标模板片段在所述模板标签序列中的目标队列位置,确定所述目标候选视频片段对应的片 段属性标签所构成的目标标签序列,将与所述目标标签序列相关联的所有目标候选视频片段进行拼接处理,得到所述视频素材片段。
  8. 根据权利要求7所述的方法,其中,所述将与所述目标标签序列相关联的所有目标候选视频片段进行拼接处理,得到所述视频素材片段,包括:
    将与所述目标标签序列相关联的所有目标候选视频片段进行视频拼接处理,得到与所述N个模板片段相关联的拼接视频数据;
    获取与所述N个模板片段相关联的模板音频数据,通过音视频合成组件将所述模板音频数据和所述拼接视频数据进行音视频合并处理,得到所述视频素材片段。
  9. 一种视频数据处理方法,包括:
    响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取所述目标视频的视频数据,以及与所述目标视频相关联的视频素材片段;所述视频素材片段是由所述服务器对所述视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照所述至少一个模板片段中各个模板片段的模板属性标签在所述模板标签序列中的位置,将匹配的至少一个视频片段进行拼接得到的;
    在所述应用客户端的应用显示界面中输出所述视频数据以及所述视频素材片段。
  10. 根据权利要求9所述的方法,其中,所述响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取所述目标视频的视频数据,以及与所述目标视频相关联的视频素材片段,包括:
    响应目标用户针对应用客户端中的目标视频执行的播放操作,生成用于请求播放所述目标视频的视频播放请求,将所述视频播放请求发送给服务器;所述视频播放请求中携带所述目标视频的视频标识;所述视频标识用于指示所述服务器获取所述目标用户所请求播放的目标视频的视频数据;
    接收所述服务器基于所述视频播放请求返回的所述视频数据,以及与所述目标视频相关联的视频素材片段;所述视频素材片段是由所述服务器在根据所述目标用户的用户画像确定出视频模板时,根据所述视频模板对所述视频数据进行视频分析以及视频匹配后所得到的,所述用户画像是由所述目标用户在所述应用客户端中的用户行为信息所确定的。
  11. 根据权利要求9所述的方法,其中,所述在所述应用客户端的应用显示界面中输出所述视频数据以及所述视频素材片段,包括:
    在所述应用客户端的应用显示界面中确定用于播放所述视频数据的视频播放界面,在所述视频播放界面中播放所述视频数据;
    响应针对所述应用显示界面的触发操作,在所述应用显示界面中播放所述视频素材片段。
  12. 一种视频数据处理装置,包括:
    片段生成模块,用于获取目标用户请求的目标视频的视频数据,对所述视频数据进行视频分析得到多个视频片段,其中,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;
    模板获取模块,用于基于所述目标用户的用户画像,从视频模板数据库中确定与所述目标用户相关联的视频模板,并获取所述视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;
    素材确定模块,用于基于所述至少一个模板片段的模板属性标签和所述多个视频片段对应的片段属性标签,在所述多个视频片段中筛选与所述至少一个模板片段的模板属性标签匹配的至少一个视频片段,按照所述至少一个模板片段中各个模板片段的模板属性标签在所述模板标签序列中的位置,将匹配的至少一个视频片段进行拼接,作为所述目标视频的视频素材片段;
    数据发送模块,用于将所述视频数据以及所述视频素材片段推送至所述目标用户对应的应用客户端,以使所述应用客户端输出所述视频数据以及所述视频素材片段。
  13. 一种视频数据处理装置,包括:
    数据获取模块,用于响应目标用户针对应用客户端中的目标视频执行的播放操作,从服务器上获取所述目标视频的视频数据,以及与所述目标视频相关联的视频素材片段;所述视频素材片段是由所述服务器对所述视频数据进行视频分析得到多个视频片段,所述视频分析包括分镜处理和基于多个预设的片段属性标签的属性分析,所述多个视频片段中的每一个视频片段对应一个片段属性标签和一个分镜片段;基于目标用户的用户画像,从视频模板数据库中确定与目标用户相关联的视频模板,并获取视频模板中预先确定的至少一个模板片段以及模板标签序列,所述模板标签序列由所述至少一个模板片段的模板属性标签构成;基于至少一个模板片段的模板属性标签和多个视频片段对应的片段属性标签,在多个视频片段中筛选与至少一个模板片段的模板属性标签匹配的至少一个视频片段;按照所述至少一个模板片段中各个模板片段的模板属性标签在所述模板标签序列中的位置,将匹配的至少一个视频片段进行拼接得到的;
    数据输出模块,用于在所述应用客户端的应用显示界面中输出所述视频数据以及所述视频素材片段。
  14. 一种计算机设备,包括:处理器、存储器、网络接口;
    所述处理器与存储器、网络接口相连,其中,网络接口用于提供数据通信功能,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以执行权利要求1-11任一项所述的方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,该计算机程序适于由处理器加载并执行权利要求1-11任一项所述的方法。
PCT/CN2021/133035 2020-12-02 2021-11-25 一种视频数据处理方法、装置、设备以及介质 WO2022116888A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/951,621 US20230012732A1 (en) 2020-12-02 2022-09-23 Video data processing method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011390109.2A CN112565825B (zh) 2020-12-02 2020-12-02 一种视频数据处理方法、装置、设备以及介质
CN202011390109.2 2020-12-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/951,621 Continuation US20230012732A1 (en) 2020-12-02 2022-09-23 Video data processing method and apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2022116888A1 true WO2022116888A1 (zh) 2022-06-09

Family

ID=75047852

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133035 WO2022116888A1 (zh) 2020-12-02 2021-11-25 一种视频数据处理方法、装置、设备以及介质

Country Status (3)

Country Link
US (1) US20230012732A1 (zh)
CN (1) CN112565825B (zh)
WO (1) WO2022116888A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115086783A (zh) * 2022-06-28 2022-09-20 北京奇艺世纪科技有限公司 一种视频生成方法、装置及电子设备

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112565825B (zh) * 2020-12-02 2022-05-13 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备以及介质
CN113259708A (zh) * 2021-04-06 2021-08-13 阿里健康科技(中国)有限公司 基于短视频介绍商品的方法、计算机设备和介质
CN115278296A (zh) * 2021-04-29 2022-11-01 汉海信息技术(上海)有限公司 视频生成方法、装置、电子设备
CN113762040A (zh) * 2021-04-29 2021-12-07 腾讯科技(深圳)有限公司 视频识别方法、装置、存储介质及计算机设备
CN115525780A (zh) * 2021-06-24 2022-12-27 北京字跳网络技术有限公司 一种模板推荐方法、装置、设备及存储介质
CN113852858A (zh) * 2021-08-19 2021-12-28 阿里巴巴(中国)有限公司 视频处理方法及电子设备
CN113794930B (zh) * 2021-09-10 2023-11-24 中国联合网络通信集团有限公司 视频生成方法、装置、设备及存储介质
CN114095668A (zh) * 2021-10-08 2022-02-25 深圳市景阳科技股份有限公司 一种视频播放方法、装置、设备及计算机存储介质
CN113691836B (zh) * 2021-10-26 2022-04-01 阿里巴巴达摩院(杭州)科技有限公司 视频模板生成方法、视频生成方法、装置和电子设备
CN114463673B (zh) * 2021-12-31 2023-04-07 深圳市东信时代信息技术有限公司 素材推荐方法、装置、设备及存储介质
CN114496173A (zh) * 2021-12-31 2022-05-13 北京航天长峰股份有限公司 短视频手术报告生成方法、装置、计算机设备及存储介质
CN114666657B (zh) * 2022-03-18 2024-03-19 北京达佳互联信息技术有限公司 一种视频剪辑方法、装置、电子设备及存储介质
CN114928753A (zh) * 2022-04-12 2022-08-19 广州阿凡提电子科技有限公司 一种视频拆分处理方法、系统及装置
CN114465737B (zh) * 2022-04-13 2022-06-24 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN115086760A (zh) * 2022-05-18 2022-09-20 阿里巴巴(中国)有限公司 直播视频剪辑方法、装置及设备
CN115278306A (zh) * 2022-06-20 2022-11-01 阿里巴巴(中国)有限公司 视频剪辑方法及装置
CN115119050B (zh) * 2022-06-30 2023-12-15 北京奇艺世纪科技有限公司 一种视频剪辑方法和装置、电子设备和存储介质
CN115659027B (zh) * 2022-10-28 2023-06-20 广州彩蛋文化传媒有限公司 一种基于短视频数据标签的推荐方法、系统及云平台
CN116304179B (zh) * 2023-05-19 2023-08-11 北京大学 一种获取目标视频的数据处理系统
CN116866498B (zh) * 2023-06-15 2024-04-05 天翼爱音乐文化科技有限公司 一种视频模板生成方法、装置、电子设备及存储介质
CN117156079B (zh) * 2023-11-01 2024-01-23 北京美摄网络科技有限公司 视频处理方法、装置、电子设备和可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593464A (zh) * 2013-11-25 2014-02-19 华中科技大学 基于视觉特征的视频指纹检测及视频序列匹配方法及系统
WO2018107914A1 (zh) * 2016-12-16 2018-06-21 中兴通讯股份有限公司 一种视频分析平台、匹配方法、精准投放广告方法及系统
CN110139159A (zh) * 2019-06-21 2019-08-16 上海摩象网络科技有限公司 视频素材的处理方法、装置及存储介质
CN111105819A (zh) * 2019-12-13 2020-05-05 北京达佳互联信息技术有限公司 剪辑模板的推荐方法、装置、电子设备及存储介质
US20200322684A1 (en) * 2017-12-07 2020-10-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Video recommendation method and apparatus
CN111866585A (zh) * 2020-06-22 2020-10-30 北京美摄网络科技有限公司 一种视频处理方法及装置
CN112565825A (zh) * 2020-12-02 2021-03-26 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备以及介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140101551A1 (en) * 2012-10-05 2014-04-10 Google Inc. Stitching videos into an aggregate video
CN110855904B (zh) * 2019-11-26 2021-10-01 Oppo广东移动通信有限公司 视频处理方法、电子装置和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593464A (zh) * 2013-11-25 2014-02-19 华中科技大学 基于视觉特征的视频指纹检测及视频序列匹配方法及系统
WO2018107914A1 (zh) * 2016-12-16 2018-06-21 中兴通讯股份有限公司 一种视频分析平台、匹配方法、精准投放广告方法及系统
US20200322684A1 (en) * 2017-12-07 2020-10-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Video recommendation method and apparatus
CN110139159A (zh) * 2019-06-21 2019-08-16 上海摩象网络科技有限公司 视频素材的处理方法、装置及存储介质
CN111105819A (zh) * 2019-12-13 2020-05-05 北京达佳互联信息技术有限公司 剪辑模板的推荐方法、装置、电子设备及存储介质
CN111866585A (zh) * 2020-06-22 2020-10-30 北京美摄网络科技有限公司 一种视频处理方法及装置
CN112565825A (zh) * 2020-12-02 2021-03-26 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备以及介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115086783A (zh) * 2022-06-28 2022-09-20 北京奇艺世纪科技有限公司 一种视频生成方法、装置及电子设备
CN115086783B (zh) * 2022-06-28 2023-10-27 北京奇艺世纪科技有限公司 一种视频生成方法、装置及电子设备

Also Published As

Publication number Publication date
CN112565825A (zh) 2021-03-26
CN112565825B (zh) 2022-05-13
US20230012732A1 (en) 2023-01-19

Similar Documents

Publication Publication Date Title
WO2022116888A1 (zh) 一种视频数据处理方法、装置、设备以及介质
WO2021238631A1 (zh) 物品信息的显示方法、装置、设备及可读存储介质
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
US8457368B2 (en) System and method of object recognition and database population for video indexing
US10296534B2 (en) Storing and searching fingerprints derived from media content based on a classification of the media content
WO2022184117A1 (zh) 基于深度学习的视频剪辑方法、相关设备及存储介质
US10104345B2 (en) Data-enhanced video viewing system and methods for computer vision processing
US20170065888A1 (en) Identifying And Extracting Video Game Highlights
US20140324840A1 (en) System and method for linking multimedia data elements to web pages
CN111062871A (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
US20130259399A1 (en) Video recommendation system and method thereof
JP2001155169A (ja) ビデオ画像の分割、分類、および要約のための方法およびシステム
CN113010703B (zh) 一种信息推荐方法、装置、电子设备和存储介质
CN103426003A (zh) 增强现实交互的实现方法和系统
CN109408672B (zh) 一种文章生成方法、装置、服务器及存储介质
US20220107978A1 (en) Method for recommending video content
KR101949881B1 (ko) 이미지 및 영상의 등록, 검색, 재생을 모바일 디바이스 및 서버에서 분할하여 수행하는 컨벌루션 인공신경망 기반 인식 시스템
CN113766299B (zh) 一种视频数据播放方法、装置、设备以及介质
CN110796098A (zh) 内容审核模型的训练及审核方法、装置、设备和存储介质
WO2021007846A1 (zh) 一种视频相似检测的方法、装置及设备
CN114339360B (zh) 一种视频处理的方法、相关装置及设备
CN114845149B (zh) 视频片段的剪辑方法、视频推荐方法、装置、设备及介质
Jin et al. Network video summarization based on key frame extraction via superpixel segmentation
JP4995770B2 (ja) 画像辞書生成装置,画像辞書生成方法,および画像辞書生成プログラム
CN116980665A (zh) 一种视频处理方法、装置、计算机设备、介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899921

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899921

Country of ref document: EP

Kind code of ref document: A1