WO2022033534A1 - 用于生成目标视频的方法、装置、服务器和介质 - Google Patents

用于生成目标视频的方法、装置、服务器和介质 Download PDF

Info

Publication number
WO2022033534A1
WO2022033534A1 PCT/CN2021/112140 CN2021112140W WO2022033534A1 WO 2022033534 A1 WO2022033534 A1 WO 2022033534A1 CN 2021112140 W CN2021112140 W CN 2021112140W WO 2022033534 A1 WO2022033534 A1 WO 2022033534A1
Authority
WO
WIPO (PCT)
Prior art keywords
metric value
video
interaction
data
preset
Prior art date
Application number
PCT/CN2021/112140
Other languages
English (en)
French (fr)
Inventor
兰枫
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to EP21855585.2A priority Critical patent/EP4178135A4/en
Priority to JP2023507247A priority patent/JP2023535989A/ja
Publication of WO2022033534A1 publication Critical patent/WO2022033534A1/zh
Priority to US17/819,507 priority patent/US11750898B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • H04L65/613Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio for the control of the source by the destination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/80Responding to QoS
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2621Cameras specially adapted for the electronic generation of special effects during image pickup, e.g. digital cameras, camcorders, video cameras having integrated special effects capability

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a method, an apparatus, a server, and a medium for generating a target video.
  • a related method is usually to first save the live streaming data as a long video file, and then use manual interception of required segments from the long video file to generate a short video.
  • the embodiments of the present application propose a method, an apparatus, a server, and a medium for generating a target video.
  • an embodiment of the present application provides a method for generating a target video, the method comprising: acquiring live streaming data, wherein the live streaming data includes at least one of voice data, live interaction data, and video data; The live streaming data is processed, and according to the target object included in the processing result, at least one of the corresponding voice metric value, interaction metric value and video metric value is generated; according to at least one of the generated voice metric value and interaction metric value One item and a video metric value to generate a comprehensive metric value of the live stream data; in response to determining that the comprehensive metric value of the live stream data meets a preset condition, generate a target video based on the live stream data.
  • the above-mentioned processing of the live stream data, and generating the corresponding video metric value according to the target object included in the processing result includes: performing image recognition on the video frames in the video data, respectively determining that they belong to the first preset category images and the number of images belonging to the second preset category; video metrics are generated according to the determined numbers of images belonging to the first preset category and images belonging to the second preset category.
  • generating the video metric value according to the determined numbers of images belonging to the first preset category and images belonging to the second preset category includes: acquiring images that are related to the first preset category and the second preset category The preset image weight values corresponding to the images respectively; the determined number of images belonging to the first preset category and the number of images belonging to the second preset category and the corresponding preset image weight values are weighted and summed to generate the above video metric value .
  • the above-mentioned live streaming data includes voice data; and the above-mentioned processing of the live streaming data, and generating a corresponding voice metric value according to the target object included in the processing result, includes: performing voice recognition on the voice data to generate a voice Recognize text; respectively determine the number of texts included in the speech recognition text that belong to the first preset category text and the second preset category text; The number of generated speech metrics.
  • the above-mentioned live streaming data includes live streaming interaction data; and the above-mentioned processing of the live streaming data, and generating a corresponding interaction metric value according to the target object included in the processing result, includes: determining a target indicated by the live streaming interaction data Number of interactions; based on the number of identified target interactions, an interaction measure is generated.
  • the target interaction behavior includes at least two of a first preset interaction behavior, a second preset interaction behavior, and a third preset interaction behavior; and the above-mentioned generating according to the determined number of target interaction behaviors
  • the interaction metric value includes: performing a weighted summation according to the determined number of target interaction behaviors and preset interaction weights corresponding to the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior to generate an interaction metric value.
  • generating a comprehensive metric value of the live streaming data according to at least one of the generated voice metric value and interaction metric value and the video metric value includes: obtaining and interacting with the generated voice metric value and interaction metric value. at least one of the values and the preset metric weight corresponding to the video metric value; normalize at least one of the generated voice metric value, interaction metric value and the video metric value; normalize the normalized voice metric value At least one of the metric value, the interaction metric value, and the video metric value is weighted and summed to generate a comprehensive metric value of the live streaming data.
  • the above preset condition includes: the number of live stream slices that satisfy the comprehensive metric value condition in the live stream slice set associated with the live stream data is greater than the target number, wherein the comprehensive metric value condition includes the corresponding live stream slices.
  • the comprehensive metric value is smaller than the comprehensive metric value of the live streaming data.
  • the above-mentioned live stream data includes voice data; and the above-mentioned generation of the target video based on the live stream data includes: determining the start and end positions of the live stream data according to the sentence integrity of the recognized text corresponding to the voice data; The live streaming data of the target video is generated.
  • generating the target video based on the edited live stream data includes: adding special effects to the edited video stream data to generate the target video.
  • an embodiment of the present application provides an apparatus for generating a target video
  • the apparatus includes: an acquisition unit configured to acquire live streaming data, wherein the live streaming data includes at least one of voice data and live interaction data one item and video data; a processing unit configured to process the live streaming data, and generate at least one of corresponding voice metrics, interaction metrics, and video metrics according to the target object included in the processing results; the first a generating unit, configured to generate a comprehensive metric value of the live streaming data according to at least one of the generated voice metric value, the interaction metric value and the video metric value; a second generating unit, configured to respond to determining the live streaming data The comprehensive metric value of satisfies the preset conditions, and the target video is generated based on the live streaming data.
  • the above-mentioned processing unit includes: a first identification subunit, configured to perform image identification on video frames in the video data, and determine the number of images belonging to the first preset category and images belonging to the second preset category, respectively a first generating subunit, configured to generate a video metric value according to the determined number of images belonging to the first preset category and images belonging to the second preset category.
  • the above-mentioned first generation subunit includes: an acquisition module configured to acquire preset image weight values corresponding to the first preset category image and the second preset category image respectively; and a generation module configured to The determined number of images belonging to the first preset category and the number of images belonging to the second preset category is weighted and summed with the corresponding preset image weight values to generate the video metric value.
  • the above-mentioned live streaming data includes speech data; and the above-mentioned processing unit includes: a second recognition subunit, configured to perform speech recognition on the speech data, and generate speech recognition text; and a first determination subunit, configured to Respectively determine the number of texts included in the speech recognition text that belong to the first preset category text and the second preset category text; the second generating subunit is configured to The number of the second preset category text, and the speech metric value is generated.
  • the live streaming data includes live interaction data; and the processing unit includes: a second determination subunit, configured to determine the number of target interaction behaviors indicated by the live interaction data; and a third generation subunit, configured by is configured to generate an interaction metric based on the determined number of target interaction behaviors.
  • the target interaction behavior includes at least two of a first preset interaction behavior, a second preset interaction behavior, and a third preset interaction behavior; and the third generation subunit is further configured to: according to The determined number of target interaction behaviors and the preset interaction weights corresponding to the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior are weighted and summed to generate an interaction metric value.
  • the above-mentioned first generating unit includes: an obtaining subunit, configured to obtain a preset metric weight corresponding to at least one of the generated speech metric value, interaction metric value, and video metric value;
  • the normalization subunit is configured to normalize at least one of the generated speech metric value, the interaction metric value and the video metric value;
  • the fourth generating subunit is configured to normalize the normalized speech metric value A weighted sum of at least one of the value, the interaction metric value, and the video metric value is performed to generate a comprehensive metric value of the live streaming data.
  • the above preset condition includes: the number of live stream slices that satisfy the comprehensive metric value condition in the live stream slice set associated with the live stream data is greater than the target number, wherein the comprehensive metric value condition includes the corresponding live stream slices.
  • the comprehensive metric value is smaller than the comprehensive metric value of the live streaming data.
  • the above-mentioned live stream data includes voice data; and the above-mentioned second generating unit includes: a third determination subunit, configured to determine the start and end of editing of the live stream data according to the sentence integrity of the recognized text corresponding to the voice data a position; a fifth generating subunit, configured to generate a target video based on the edited live stream data.
  • the above-mentioned fifth generating subunit is further configured to: add special effects to the edited video stream data to generate the target video.
  • an embodiment of the present application provides a server, the server includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are processed by one or more The processor executes such that the one or more processors implement the method as described in any one of the implementations of the first aspect.
  • an embodiment of the present application provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any implementation manner of the first aspect.
  • embodiments of the present application further provide a computer program product, including computer program instructions, where the computer program instructions cause a computer to execute the method described in any implementation manner of the first aspect.
  • the embodiments of the present application further provide a computer program, when the computer program runs on the computer, the computer can execute the method described in any one of the implementation manners of the first aspect.
  • the method, device, server, and medium for generating a target video provided by the embodiments of the present application respectively process at least one item of voice data, live interaction data and video data included in the acquired live streaming data to generate a A comprehensive metric value obtained from at least one of the speech metric value, the interaction metric value, and the video metric value, finally generates the target video.
  • the automatic generation of the target video is realized, and on the other hand, the generation basis of the selected target video is comprehensively selected from at least one of voice, live interaction and video, etc., and the quality and generation efficiency of the generated target video are improved. .
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application may be applied;
  • FIG. 2 is a flowchart of one embodiment of a method for generating a target video according to the present application
  • FIG. 3 is a schematic diagram of an application scenario of a method for generating a target video according to an embodiment of the present application
  • FIG. 4 is a flowchart of yet another embodiment of a method for generating a target video according to the present application
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for generating a target video according to the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device suitable for implementing embodiments of the present application.
  • FIG. 1 shows an exemplary architecture 100 to which the method for generating a target video or the apparatus for generating a target video of the present application may be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, and 103 interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, text editing applications, video Live applications, etc.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 are hardware, they can be various electronic devices having a display screen and supporting audio and video transmission, including but not limited to smart phones, tablet computers, laptop computers, and desktop computers.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as a plurality of software or software modules (eg, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the server 105 may be a server that provides various services, such as a background server that provides support for live video applications on the terminal devices 101 , 102 , and 103 .
  • the background server can analyze and process the received live streaming data, and feed back the processing results (such as target video) to the terminal device.
  • live streaming data can also be directly stored locally on the server 105, and the server 105 can directly extract the locally stored live streaming data and process it.
  • the terminal devices 101, 102, 103 and network 104 can directly extract the locally stored live streaming data and process it.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (for example, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the method for generating a target video provided by the embodiments of the present application is generally executed by the server 105 , and accordingly, the apparatus for generating the target video is generally set in the server 105 .
  • the terminals 101, 102 and 103 can be used to execute the method for generating the target video; the terminal 101 can also collect live stream data, and send the collected live stream data to the server 105, so that the server 105 can execute the method. method for generating target video.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • a flow 200 of one embodiment of a method for generating a target video according to the present application is shown.
  • the method for generating a target video includes the following steps:
  • Step 201 acquiring live streaming data.
  • the execution body of the method for generating the target video may acquire live streaming data through a wired connection or a wireless connection.
  • the above-mentioned live streaming data may include at least one of voice data, live interactive data, and video data. Therefore, the above-mentioned live streaming data may include voice data and video data, live interactive data and video data, and may also include voice data, live interactive data and video data.
  • the above-mentioned voice data is usually time-synchronized with the above-mentioned video data.
  • the above-mentioned live broadcast interaction data may include data used to record the interaction between the host and the audience in the live broadcast.
  • the above-mentioned live interaction data may include, but are not limited to, at least one of the following: the number of barrages in a preset time period (for example, per minute), and the number of endorsements (such as likes, gifts) in the preset time period (for example, per minute). , the number of comments and messages in a preset period (eg, every minute).
  • the above-mentioned execution body may acquire live streaming data in real time from an electronic device (eg, the terminal device shown in FIG. 1 ) that is communicatively connected to it.
  • the above-mentioned execution body may acquire live streaming data pre-stored locally.
  • the above-mentioned live streaming data may be pre-stored and obtained by performing video slicing on historical live streaming data.
  • the above-mentioned video slice may also correspond to the start time and the end time in the above-mentioned historical live streaming data.
  • Step 202 Process the live streaming data, and generate at least one of a corresponding voice metric value, an interaction metric value, and a video metric value according to the target object included in the processing result.
  • the above-mentioned execution body can process the live streaming data obtained in the above-mentioned step 201 in various ways; and according to the target object included in the processing result, the above-mentioned execution body can generate corresponding voice metrics and interaction metrics At least one of the values and the video metric value.
  • the above-mentioned execution subject may extract acoustic features from the voice data in the acquired live streaming data in various ways.
  • the above acoustic features may include but are not limited to at least one of the following: Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), pitch, timbre, and loudness.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPCC Linear Prediction Cepstrum Coefficient
  • pitch timbre
  • loudness the above-mentioned executive body can generate the speech metric value corresponding to the above-mentioned extracted acoustic feature in various ways.
  • the above-mentioned executive body may use a pre-trained artificial neural network to generate the speech metric value corresponding to the above-mentioned extracted acoustic feature.
  • the above artificial neural network can be trained by using the voice data corresponding to the highlight clips of the historical live streaming data as positive samples and the voice data corresponding to ordinary clips as negative samples.
  • the above-mentioned speech metric value may be a value between 0 and 1.
  • the execution body may compare the extracted acoustic features with preset thresholds corresponding to the acoustic features, and then generate a speech metric value corresponding to the extracted acoustic features according to the number greater than the corresponding preset threshold.
  • the above-mentioned execution body may process the live broadcast interaction data in the acquired live broadcast stream data in various ways to generate a corresponding interaction metric value.
  • the above-mentioned executive body may use the number of time periods in which the number of bullet screens or comments exceeds a preset threshold as the above-mentioned interaction metric value.
  • the above live streaming data includes 5 minutes of data. The number of barrages from 0 to 1 minutes is 15, the number of barrages from 1 to 2 minutes is 28, the number of barrages from 2 to 3 minutes is 85, and the number of barrages from 3 to 4 minutes is 66 The number of barrages in the 4th to 5th minutes is 32. Assuming that the above-mentioned preset threshold is 50, the above-mentioned executive body may determine the above-mentioned interaction metric value to be 2.
  • the above-mentioned execution body may process video data in the acquired live streaming data in various ways to generate corresponding video metric values.
  • the above-mentioned executive body may determine the number of video frames including the target image in the above-mentioned live streaming data.
  • the video metric value corresponding to the live streaming data is generated according to the determined number of video frames including the target image.
  • the above-mentioned execution body may process the live streaming data according to the following steps, and generate corresponding video metrics according to the target object included in the processing result:
  • image recognition is performed on the video frames in the video data, and the number of images belonging to the first preset category and the number of images belonging to the second preset category are respectively determined.
  • the above-mentioned execution body can use various image recognition methods to perform image recognition on the video frames in the above-mentioned video data, and respectively determine the number of images belonging to the first preset category and the number of images belonging to the second preset category.
  • the above-mentioned first preset category image and second preset category image may include preset images associated with application scenarios.
  • the image of the first preset category may be an image of a dunk
  • the image of the second preset category may be an image of a shot outside the three-point line.
  • the above-mentioned execution body may extract frames from the above-mentioned video data, for example, extract 1 frame every 10 frames. Then, image recognition is performed on the video frames extracted by the above-mentioned executive body. Thereby saving computing resources.
  • the above-mentioned first preset category image may include images used to represent the sale of commodities, such as commodity images, price tags, and the like.
  • the above-mentioned second preset category image may include a preset person image.
  • the above-mentioned preset character may be, for example, a host. Therefore, the above method can provide a technical basis for identifying highlight moments in live video sales from the perspective of image recognition.
  • a video metric value is generated according to the determined numbers of images belonging to the first preset category and images belonging to the second preset category.
  • the above-mentioned executive body may generate the video metric value in various ways.
  • the above-mentioned execution subject may select the larger value of the above-determined number of images belonging to the first preset category and the number of images belonging to the second preset category as the video metric value.
  • the above-mentioned execution body may also use the ratio of the above-mentioned larger value to the video frame for image recognition as the above-mentioned video metric value.
  • the above-mentioned execution body may also first obtain image preset weight values (eg, 1 and 0.5) corresponding to the above-mentioned first preset category image and second preset category image respectively. Then, the executive body may perform a weighted sum of the determined number of images belonging to the first preset category and the images belonging to the second preset category and the corresponding preset image weight values to generate the video metric value.
  • image preset weight values eg, 1 and 0.5
  • the execution body may process the live stream data according to the following steps, and generate corresponding voice according to the target object included in the processing result.
  • the first step is to perform speech recognition on the speech data to generate speech recognition text.
  • the above-mentioned execution subject may recognize the speech data included in the live streaming data obtained in the above step 201 through various speech recognition technologies, and generate corresponding speech recognition text.
  • the second step respectively determine the number of texts included in the speech recognition text that belong to the first preset category text and the number that belong to the second preset category text.
  • the above-mentioned execution subject may determine the number of texts included in the speech recognition text that belong to the first preset category of text and the number of texts that belong to the second preset category, respectively.
  • the above-mentioned first preset category text and second preset category text may include preset texts associated with application scenarios.
  • the above-mentioned first preset category text may include preset descriptors such as "strike”, “beautiful", "really wonderful”;
  • the above-mentioned second preset category text may include "let's take a look", "note that” ” and other preset prompt words.
  • the above-mentioned first preset category text may include commodity description information.
  • the above-mentioned commodity description information may include commodity name, commodity evaluation information (for example, "it's delicious", “easy to use and not expensive”, etc.) and the like.
  • the above-mentioned second preset category text may include preset sales keywords.
  • the above-mentioned preset sales keywords may include, for example, "up link", "come to buy” and the like. Therefore, the above method can provide a technical basis for identifying highlight moments in live video sales from the perspective of speech recognition.
  • the third step is to generate a speech metric value according to the determined numbers of texts belonging to the first preset category and texts belonging to the second preset category.
  • the above-mentioned executive body may generate the speech metric value in various ways.
  • the above-mentioned executive body may select the larger value of the above-determined number of texts belonging to the first preset category and the number of texts belonging to the second preset category as the voice metric value.
  • the above-mentioned execution body may also use the ratio of the above-mentioned larger value to the number of words included in the recognized text as the above-mentioned speech metric value.
  • the above-mentioned execution body may first obtain the text preset weight values corresponding to the above-mentioned first preset category text and second preset category text respectively. Then, the execution body may perform a weighted sum of the determined number of texts belonging to the first preset category and the texts belonging to the second preset category and the corresponding preset text weights to generate the speech metric value.
  • the text preset weight value ( For example, 1) is usually smaller than the above-mentioned preset text weight value (eg, 5) corresponding to the text of the second preset category.
  • the execution body may process the live streaming data according to the following steps, and generate corresponding Interaction measures:
  • the first step is to determine the number of target interaction behaviors indicated by the live interaction data.
  • the above-mentioned execution body may determine the number of target interaction behaviors indicated by the live broadcast interaction data included in the live broadcast stream data obtained in the above step 201 according to various methods.
  • the above-mentioned target interaction behaviors may include, but are not limited to, at least one of the following: sending a barrage, expressing approval of the host's behavior (such as liking, sending gifts), sending comments, and writing messages.
  • an interaction metric value is generated according to the determined number of target interaction behaviors.
  • the execution subject may generate the interaction metric value in various ways.
  • the above-mentioned execution body may directly use the above-determined number of target interaction behaviors as the above-mentioned interaction metric value.
  • the above-mentioned execution body may also use the ratio of the above-determined number of target interaction behaviors to a preset value as the above-mentioned interaction metric value.
  • the above-mentioned target interaction behavior may include at least two of a first preset interaction behavior, a second preset interaction behavior, and a third preset interaction behavior.
  • the above-mentioned execution subject may use the determined number of target interaction behaviors and preset interactions corresponding to the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior
  • the weights are weighted and summed to generate the above interaction measure.
  • the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior may include preset interaction behaviors associated with application scenarios.
  • the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior may respectively include the appearance of commodity links in the live broadcast screen, the generation of orders for commodity links provided by the above-mentioned live streaming data, and the sending of bombs. screen. Therefore, the above method can provide a technical basis for identifying highlight moments in live video sales from the perspective of interactive behavior.
  • Step 203 Generate a comprehensive metric value of the live streaming data according to at least one of the generated voice metric value, the interaction metric value, and the video metric value.
  • the above-mentioned executive body may generate a comprehensive metric value of the live streaming data in various ways.
  • the above-mentioned executive body may select a maximum value from at least one of the generated voice metric value, interaction metric value, and video metric value as the comprehensive metric value of the above-mentioned live streaming data.
  • the above-mentioned execution body may generate a comprehensive metric value of the live streaming data according to the following steps:
  • a preset metric weight corresponding to at least one of the generated speech metric value, interaction metric value, and video metric value is acquired.
  • the above-mentioned executive body may first obtain preset metric weights corresponding to at least one of the generated speech metric value, interaction metric value, and video metric value.
  • the above-mentioned preset metric weight may be, for example, 0.3, 0.3, or 0.4.
  • At least one of the generated speech metric value, interaction metric value and video metric value is normalized.
  • the execution body may normalize at least one of the speech metric value, the interaction metric value, and the video metric value generated in the first step. Therefore, at least one of the speech metric value and the interaction metric value generated in the first step after normalization and the video metric value belong to the same order of magnitude.
  • weighted summation is performed on at least one of the normalized voice metric value, the interaction metric value, and the video metric value to generate a comprehensive metric value of the live streaming data.
  • the above-mentioned executive body may perform a weighted sum of at least one of the normalized voice metric value, the interaction metric value and the video metric value obtained in the second step to generate a comprehensive live stream data metric.
  • Step 204 in response to determining that the comprehensive metric value of the live streaming data satisfies a preset condition, generate a target video based on the live streaming data.
  • the above-mentioned execution body may generate the target video based on the live stream data in various ways.
  • the foregoing preset condition may include that the comprehensive metric value of the foregoing live streaming data is greater than a preset metric value threshold.
  • the above-mentioned execution body may directly use the above-mentioned live streaming data as the above-mentioned target video.
  • the above-mentioned execution body may perform post-processing on the above-mentioned live streaming data, so as to obtain the above-mentioned target video.
  • the above-mentioned post-processing may include, for example, adding filters, adjusting brightness, adjusting contrast, and the like.
  • the above preset conditions may include: the number of live stream slices that satisfy the comprehensive metric value condition in the live stream slice set associated with the above live stream data is greater than the target number.
  • the above-mentioned comprehensive metric value condition may include that the comprehensive metric value corresponding to the live stream slice is smaller than the above-mentioned comprehensive metric value of the live stream data.
  • the above target number can be any number pre-specified according to actual application requirements.
  • the above-mentioned target number may also be a number determined according to a rule, such as a number obtained by multiplying the number of live stream slices included in the above-mentioned associated live stream slice set by a preset ratio.
  • the above-mentioned live stream slice set associated with the above-mentioned live stream data may include time-segmented live stream data slices obtained from the same live stream information source (eg, live room id) corresponding to the same live stream. It is assumed that the live stream slice set associated with the above live stream data includes 10 live stream slices. The above target number is 6. If the number of the comprehensive metric values corresponding to the above-mentioned live stream slice set is smaller than the above-mentioned comprehensive metric values of the live stream data is greater than 6, the above preset condition is satisfied.
  • FIG. 3 is a schematic diagram of an application scenario of the method for generating a target video according to an embodiment of the present application.
  • the user 301 uses the terminal device 302 to perform live broadcasting.
  • the terminal device 302 sends the live streaming data 303 to the background server 304 .
  • the above-mentioned live streaming data may include voice data and video data.
  • the background server 304 processes the live streaming data 303, and generates a voice measurement value 80 and a video measurement value 70 according to the included objects (such as the voice of "strike” and the "dunk image") that characterize the degree of brilliance (as shown in FIG. 3 ). 305).
  • the background server 304 averages the generated voice metric value and video metric value to generate a comprehensive metric value 75 (shown as 306 in FIG. 3 ). Afterwards, according to the fact that the comprehensive metric value 75 is greater than a preset threshold (for example, 70), the background server 304 may generate a beautiful clip video 307 based on the live streaming data 303 . Optionally, the background server 304 may also send the generated highlight video 307 to the terminal device 302 .
  • a preset threshold for example, 70
  • one of the existing technologies usually saves the live streaming data as a long video file, and then manually intercepts the required segments from the long video file to generate a short video, which requires a lot of labor costs.
  • a voice metric value and an interaction metric value are generated.
  • a comprehensive metric value obtained from at least one item and the video metric value, and finally generates the target video.
  • the solution provided by the present application realizes the automatic generation of the target video and effectively reduces the labor cost.
  • the solution provided by the present application comprehensively selects the generation basis of the target video from at least one of voice, live interaction and video, etc. The quality and generation efficiency of the generated target video.
  • the process 400 of the method for generating a target video includes the following steps:
  • Step 401 obtaining live streaming data.
  • the above-mentioned live streaming data may include voice data and video data.
  • Step 402 Process the live stream data, and generate at least one of a corresponding voice metric value, an interaction metric value, and a video metric value according to the target object included in the processing result.
  • Step 403 Generate a comprehensive metric value of the live streaming data according to at least one of the generated voice metric value, the interaction metric value, and the video metric value.
  • steps 401, 402, and 403 are respectively consistent with the steps 201, 202, 203 and their optional implementations in the foregoing embodiment, and the above is for steps 201, 202, and 203 and their optional implementations
  • the description of the method is also applicable to step 401 , step 402 and step 403 , and details are not repeated here.
  • Step 404 in response to determining that the comprehensive metric value of the live streaming data meets a preset condition, determine the start and end positions of the editing of the live streaming data according to the sentence continuity of the recognized text corresponding to the voice data; and generate a target video based on the edited live streaming data.
  • the execution body of the method for generating the target video may generate the target video according to the following steps:
  • the start and end positions of the clipping of the live streaming data are determined.
  • the above-mentioned execution body may first determine the completeness of the sentence according to the recognized text corresponding to the speech data. Then, according to the determined completeness of the sentence, the above-mentioned execution body may determine the start and end positions of the clipping of the above-mentioned live streaming data in various ways. Wherein, the start and end positions of the clip may include the start position and the end position of the clip. As an example, in response to determining that the recognized text sentence corresponding to the voice data is complete (for example, "XX is really delicious"), the execution subject may determine the start and end positions of the voice data as the start and end positions of the clip.
  • the above-mentioned execution body may only have the second half sentence of the sentence The end position is determined as the start position of the clip, and the start position of the sentence having only the first half sentence is determined as the end position of the clip.
  • a target video is generated based on the edited live stream data.
  • the above-mentioned execution body may generate the target video based on the edited live stream data in various ways.
  • the foregoing executive body may directly determine the edited live stream data as the foregoing target video.
  • the execution body may perform post-processing on the edited live stream data, and generate a target video according to the post-processed live stream data.
  • the above-mentioned execution body may also add special effects to the edited video stream data to generate the target video.
  • the above special effects may include, but are not limited to, at least one of the following: subtitles, stickers, and transition effects.
  • the process 400 of the method for generating a target video in this embodiment embodies the steps of determining the start and end positions of the clipping of the live streaming data according to the sentence integrity of the recognized text corresponding to the speech data, and based on The steps of generating the target video from the edited live stream data. Therefore, the solution described in this embodiment can generate the target video according to the sentence integrity of the recognized text corresponding to the speech data, thereby ensuring the integrity of the sentences in the target video.
  • the present application provides an embodiment of an apparatus for generating a target video, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 or FIG. 4 .
  • the device can be specifically applied to various electronic devices.
  • the apparatus 500 for generating a target video includes an acquiring unit 501 , a processing unit 502 , a first generating unit 503 and a second generating unit 504 .
  • the obtaining unit 501 is configured to obtain live streaming data, wherein the live streaming data includes at least one of voice data, live interactive data and video data;
  • the processing unit 502 is configured to process the live streaming data according to
  • the target object included in the processing result generates at least one of the corresponding voice metric value, the interaction metric value and the video metric value;
  • the first generating unit 503 is configured to generate according to the generated voice metric value, interaction metric value.
  • the second generating unit 504 is configured to generate a target video based on the live streaming data in response to determining that the comprehensive metric value of the live streaming data satisfies a preset condition.
  • Figs. 2 corresponds to the relevant descriptions of step 201, step 202, step 203, and step 204 in the embodiment, which are not repeated here.
  • the foregoing processing unit 502 may include a first identifying subunit (not shown in the figure) and a first generating subunit (not shown in the figure).
  • the above-mentioned first identification subunit may be configured to perform image identification on video frames in the video data, and determine the number of images belonging to the first preset category and images belonging to the second preset category, respectively.
  • the above-mentioned first generating subunit may be configured to generate a video metric value according to the determined number of images belonging to the first preset category and images belonging to the second preset category.
  • the above-mentioned first generating subunit may include an acquiring module (not shown in the figure) and a generating module (not shown in the figure).
  • the above obtaining module may be configured to obtain preset image weight values corresponding to the first preset category image and the second preset category image respectively.
  • the above generating module may be configured to perform a weighted summation of the determined number of images belonging to the first preset category and the number of images belonging to the second preset category and the corresponding preset image weight values to generate the video metric value.
  • the foregoing live streaming data may include voice data.
  • the above-mentioned processing unit 502 may include a second identification subunit (not shown in the figure), a first determination subunit (not shown in the figure), and a second generation subunit (not shown in the figure).
  • the above-mentioned second recognition subunit may be configured to perform speech recognition on speech data to generate speech recognition text.
  • the above-mentioned first determining subunit may be configured to respectively determine the number of texts included in the speech recognition text that belong to the first preset category of texts and the number of texts that belong to the second preset category of texts.
  • the above-mentioned second generating subunit may be configured to generate a speech metric value according to the determined number of texts belonging to the first preset category and texts belonging to the second preset category.
  • the foregoing live streaming data may include live streaming interaction data.
  • the above-mentioned processing unit 502 may include a second determination subunit (not shown in the figure) and a second generation subunit (not shown in the figure).
  • the above-mentioned second determination subunit may be configured to determine the number of target interaction behaviors indicated by the live interaction data.
  • the above-mentioned third generating subunit may be configured to generate an interaction metric value according to the determined number of target interaction behaviors.
  • the above-mentioned target interaction behavior may include at least two of a first preset interaction behavior, a second preset interaction behavior, and a third preset interaction behavior.
  • the above-mentioned third generating subunit may be further configured to: perform weighting according to the number of the determined target interaction behaviors and the preset interaction weights corresponding to the first preset interaction behavior, the second preset interaction behavior, and the third preset interaction behavior Summed, resulting in an interaction measure.
  • the foregoing first generating unit 503 may include an acquiring subunit (not shown in the figure), a normalizing subunit (not shown in the figure), and a fourth generating subunit (not shown in the figure).
  • the obtaining subunit may be configured to obtain preset metric weights corresponding to at least one of the generated voice metric value, interaction metric value, and video metric value.
  • the above-mentioned normalization subunit may be configured to normalize at least one of the generated speech metric value, interaction metric value, and video metric value.
  • the above-mentioned fourth generating subunit may be configured to perform weighted summation of at least one of the normalized voice metric value, the interaction metric value, and the video metric value to generate a comprehensive metric value of the live streaming data.
  • the above-mentioned preset conditions may include: the number of live stream slices that satisfy the comprehensive metric value condition in the live stream slice set associated with the live stream data is greater than the target number, wherein the comprehensive metric The value condition includes that the comprehensive metric value corresponding to the live stream slice is smaller than the comprehensive metric value of the live stream data.
  • the foregoing live streaming data may include voice data.
  • the above-mentioned second generation unit 504 may include a third determination subunit (not shown in the figure) and a fifth generation subunit (not shown in the figure).
  • the above-mentioned third determination subunit may be configured to determine the start and end positions of the clipping of the live streaming data according to the sentence integrity of the recognized text corresponding to the speech data.
  • the above-mentioned fifth generating subunit may be configured to generate the target video based on the edited live stream data.
  • the fifth generating subunit may be further configured to add special effects to the edited video stream data to generate the target video.
  • the apparatuses provided by the above embodiments of the present application acquire live streaming data through the acquiring unit 501 .
  • the live streaming data includes at least one item of voice data, live interactive data, and video data.
  • the processing unit 502 processes the live streaming data, and generates at least one of the corresponding voice metric value, interaction metric value, and video metric value according to the target object included in the processing result.
  • the first generating unit 503 generates a comprehensive metric value of the live streaming data according to the generated at least one of the voice metric value, the interaction metric value, and the video metric value.
  • the second generating unit 504 generates a target video based on the live streaming data in response to determining that the comprehensive metric value of the live streaming data satisfies the preset condition.
  • the automatic generation of the target video is realized, and on the other hand, the generation basis of the selected target video is comprehensively selected from at least one of voice, live interaction and video, etc., and the quality and generation efficiency of the generated target video are improved. .
  • FIG. 6 it shows a schematic structural diagram of an electronic device (eg, the server in FIG. 1 ) 600 suitable for implementing an embodiment of the present application.
  • Terminal devices in the embodiments of the present application may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants, personal digital assistants), PADs (tablet computers, portable android devices), and PMPs (portable multimedia devices).
  • PDAs personal digital assistants, personal digital assistants
  • PADs tablet computers, portable android devices
  • PMPs portable multimedia devices
  • Players personal multimedia players
  • mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), etc.
  • stationary terminals such as digital TVs, desktop computers, and the like.
  • the server shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • the electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601, which may be based on a program stored in a read-only memory (ROM) 602 or from a storage device 608 programs loaded into random access memory (RAM) 603 to perform various appropriate actions and processes. In the RAM 603, various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, etc.; outputs including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc. Device 607; including storage device 608 such as tape, hard disk, etc.; and communication device 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.
  • input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, etc.
  • outputs including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.
  • Device 607 including storage device 608 such as tape, hard disk, etc.
  • embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing device 601 the above-mentioned functions defined in the methods of the embodiments of the present application are executed.
  • the computer-readable medium described in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (electrical programmable ROM, EPROM or flash memory), optical fiber, portable compact disk read-only memory (compact disc ROM, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned server; or may exist alone without being assembled into the server.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the server, the server is made to obtain live streaming data, wherein the live streaming data includes at least one of voice data and live interactive data.
  • one item and video data process the live streaming data, and generate at least one of the corresponding voice metrics, interaction metrics and video metrics according to the target object included in the processing results; according to the generated voice metrics, At least one of the interaction metric values and the video metric value is used to generate a comprehensive metric value of the live stream data; in response to determining that the comprehensive metric value of the live stream data meets a preset condition, a target video is generated based on the live stream data.
  • the present disclosure also provides a computer program, the computer program causing a computer to execute the method for generating a target video provided by the above embodiments.
  • Computer program code for performing the operations of the embodiments of the present application may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and also This includes conventional procedural programming languages - such as the "C" language, Python or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • an Internet service provider to connect via the Internet
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the described unit may also be provided in a processor, for example, it may be described as: a processor, including an acquiring unit, a processing unit, a first generating unit, and a second generating unit.
  • a processor including an acquiring unit, a processing unit, a first generating unit, and a second generating unit.
  • the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the acquisition unit can also be described as "a unit for acquiring live streaming data, wherein the live streaming data includes voice data, live interactive data at least one of and video data".

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Studio Devices (AREA)

Abstract

本申请实施例公开了用于生成目标视频的方法、装置、服务器和介质。该方法的一具体实施方式包括:获取直播流数据,其中,该直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;对该直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;根据所生成的各种度量值,生成该直播流数据的综合度量值;响应于确定该直播流数据的综合度量值满足预设条件,基于该直播流数据生成目标视频。该实施方式实现了目标视频的自动生成,并且从语音、直播交互中的至少一项和视频等多方面综合选定目标视频的生成基础,提高了所生成的目标视频的质量和生成效率。

Description

用于生成目标视频的方法、装置、服务器和介质
相关申请的交叉引用
本申请要求于2020年8月12日提交中国专利局、申请号为202010806612.5、发明名称为“用于生成目标视频的方法、装置、服务器和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,具体涉及用于生成目标视频的方法、装置、服务器和介质。
背景技术
随着互联网技术的飞速发展,视频直播的应用也越来越广泛。
相关的方式通常是首先将直播流数据保存成长视频文件,再利用人工从长视频文件中截取所需要的片段,生成短视频。
发明内容
本申请实施例提出了用于生成目标视频的方法、装置、服务器和介质。
第一方面,本申请实施例提供了一种用于生成目标视频的方法,该方法包括:获取直播流数据,其中,直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值;响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。
在一些实施例中,上述对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的视频度量值,包括:对视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目;根据所确定的属于第一预设类别图像和属于第二预设类别图像的数目,生成视频度量值。
在一些实施例中,上述根据所确定的属于第一预设类别图像和属于第二预设类别图像的数目,生成视频度量值,包括:获取与第一预设类别图像和第二预设类别图像分别对应的预设图像权重值;将所确定的属于第一预设类别图像和属于第二预设类别图像的数目与分别对应的图像预设权重值进行加权求和,生成上述视频度量值。
在一些实施例中,上述直播流数据包括语音数据;以及上述对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值,包括:对语音数据进行语音识别,生成语音识别文本;分别确定语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设类别文本的数目;根据所确定的属于第一预设类别文本和属于第二预设类别文本的数目,生成语音度量值。
在一些实施例中,上述直播流数据包括直播交互数据;以及上述对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的交互度量值,包括:确定直播交互数据所指示的目标交互行为的数目;根据所确定的目标交互行为的数目,生成交互度量值。
在一些实施例中,上述目标交互行为包括第一预设交互行为、第二预设交互行为、第三预设交互行为中的至少两者;以及上述根据所确定的目标交互行为的数目,生成交互度量值,包括:根据所确定的目标交互行为的数目和第一预设交互行为、第二预设交互行为、第三预设交互行为对应的预设交互权重进行加权求和,生成交互度量值。
在一些实施例中,上述根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值,包括:获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重;对所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化;将归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成直播流数据的综合度量值。
在一些实施例中,上述预设条件包括:与直播流数据关联的直播流切片集合中满足综合度量值条件的直播流切片的数目大于目标数目,其中,综合度量值条件包括直播流切片对应的综合度量值小于直播流数据的综合度量值。
在一些实施例中,上述直播流数据包括语音数据;以及上述基于直播流数据生成目标视频,包括:根据语音数据对应的识别文本的语句完整性,确定直播流数据的剪辑起止位置;基于剪辑后的直播流数据生成目标视频。
在一些实施例中,上述基于剪辑后的直播流数据生成目标视频,包括:向剪辑后的视频流数据添加特效,生成目标视频。
第二方面,本申请实施例提供了一种用于生成目标视频的装置,该装置包括:获取单元,被配置成获取直播流数据,其中,直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;处理单元,被配置成对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;第一生成单元,被配置成根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值;第二生成单元,被配置成响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。
在一些实施例中,上述处理单元包括:第一识别子单元,被配置成对视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目;第一生成子单元,被配置成根据所确定的属于第一预设类别图像和属于第二预设类别图像的数目,生成视频度量值。
在一些实施例中,上述第一生成子单元包括:获取模块,被配置成获取与第一预设类别图像和第二预设类别图像分别对应的预设图像权重值;生成模块,被配置成将所确定的属于第一预设类别图像和属于第二预设类别图像的数目与分别对应的图像预设权重值进行加权求和,生成上述视频度量值。
在一些实施例中,上述直播流数据包括语音数据;以及上述处理单元包括:第二识别子单元,被配置成对语音数据进行语音识别,生成语音识别文本;第一确定子单元,被配置成分别确定语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设类别文本的数目;第二生成子单元,被配置成根据所确定的属于第一预设类别文本和属于第二预设类别文 本的数目,生成语音度量值。
在一些实施例中,上述直播流数据包括直播交互数据;以及上述处理单元包括:第二确定子单元,被配置成确定直播交互数据所指示的目标交互行为的数目;第三生成子单元,被配置成根据所确定的目标交互行为的数目,生成交互度量值。
在一些实施例中,上述目标交互行为包括第一预设交互行为、第二预设交互行为、第三预设交互行为中的至少两者;以及上述第三生成子单元进一步被配置成:根据所确定的目标交互行为的数目和第一预设交互行为、第二预设交互行为、第三预设交互行为对应的预设交互权重进行加权求和,生成交互度量值。
在一些实施例中,上述第一生成单元包括:获取子单元,被配置成获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重;归一化子单元,被配置成对所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化;第四生成子单元,被配置成将归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成直播流数据的综合度量值。
在一些实施例中,上述预设条件包括:与直播流数据关联的直播流切片集合中满足综合度量值条件的直播流切片的数目大于目标数目,其中,综合度量值条件包括直播流切片对应的综合度量值小于直播流数据的综合度量值。
在一些实施例中,上述直播流数据包括语音数据;以及上述第二生成单元包括:第三确定子单元,被配置成根据语音数据对应的识别文本的语句完整性,确定直播流数据的剪辑起止位置;第五生成子单元,被配置成基于剪辑后的直播流数据生成目标视频。
在一些实施例中,上述第五生成子单元进一步被配置成:向剪辑后的视频流数据添加特效,生成目标视频。
第三方面,本申请实施例提供了一种服务器,该服务器包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
第四方面,本申请实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面中任一实现方式描述的方法。
第五方面,本申请实施例还提供了一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行如第一方面中任一实现方式描述的方法。
第六方面,本申请实施例还提供了一种计算机程序,当计算机程序在计算机上运行时,使得计算机执行如第一方面中任一实现方式描述的方法。
本申请实施例提供的用于生成目标视频的方法、装置、服务器和介质,通过对获取的直播流数据中包括的语音数据、直播交互数据中的至少一项和视频数据分别进行处理,生成由语音度量值、交互度量值中的至少一项和视频度量值而得到的综合度量值,最终生成目标视频。从而一方面实现了目标视频的自动生成,另一方面从语音、直播交互中的至少一项和视频等多方面综合选定目标视频的生成基础,提高了所生成的目标视频的质量和生成效率。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本申请的用于生成目标视频的方法的一个实施例的流程图;
图3是根据本申请的实施例的用于生成目标视频的方法的一个应用场景的示意图;
图4是根据本申请的用于生成目标视频的方法的又一个实施例的流程图;
图5是根据本申请的用于生成目标视频的装置的一个实施例的结构示意图;
图6是适于用来实现本申请的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关本公开相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了可以应用本申请的用于生成目标视频的方法或用于生成目标视频的装置的示例性架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件、文本编辑类应用、视频直播类应用等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏并且支持音视频传输的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如为终端设备101、102、103上视频直播应用提供支持的后台服务器。后台服务器可以对接收的直播流数据进行分析等处理,并将处理结果(如目标视频)反馈给终端设备。
需要说明的是,上述直播流数据也可以直接存储在服务器105的本地,服务器105可以直接提取本地所存储的直播流数据并进行处理,此时,可以不存在终端设备101、102、103和网络104。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本申请实施例所提供的用于生成目标视频的方法一般由服务器105执行,相应地,用于生成目标视频的装置一般设置于服务器105中。
需要说明的是,终端101、102、103可用于执行该用于生成目标视频的方法;终端101也可以采集直播流数据,将采集到的直播流数据发送至服务器105,使得服务器105执行该用于生成目标视频的方法。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的用于生成目标视频的方法的一个实施例的流程200。该用于生成目标视频的方法包括以下步骤:
步骤201,获取直播流数据。
在本实施例中,用于生成目标视频的方法的执行主体(如图1所示的服务器105)可以通过有线连接方式或者无线连接方式获取直播流数据。其中,上述直播流数据可以包括语音数据、直播交互数据中的至少一项和视频数据。从而,上述直播流数据可以包括语音数据和视频数据,也可以包括直播交互数据和视频数据,还可以包括语音数据、直播交互数据和视频数据。上述语音数据通常与上述视频数据在时间上同步。上述直播交互数据可以包括用于记录直播中主播与观众的互动情况的数据。上述直播交互数据可以包括但不限于以下至少一项:预设时段(例如每分钟)中弹幕的数目,预设时段(例如每分钟)中表征赞同主播(例如点赞、送礼物)的数目,预设时段(例如每分钟)中评论、留言的数目。
作为示例,上述执行主体可以从与之通信连接的电子设备(例如图1所示的终端设备)实时获取直播流数据。作为又一示例,上述执行主体可以获取预先存储于本地的直播流数据。其中,上述直播流数据可以是预先存储的、对历史直播流数据进行视频切片所得到的。上述视频切片还可以对应有在上述历史直播流数据中的起始时间和结束时间。
步骤202,对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值。
在本实施例中,上述执行主体可以通过各种方式对上述步骤201所获取的直播流数据进行处理;以及根据处理结果所包括的目标对象,上述执行主体可以生成对应的语音度量值、交互度量值中的至少一项和视频度量值。
在本实施例中,作为示例,上述执行主体可以通过各种方式从所获取的直播流数据中的语音数据提取声学特征。其中,上述声学特征可以包括但不限于以下至少一项:梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),线性预测倒谱系数(Linear Prediction Cepstrum Coefficient,LPCC),音调,音色,响度。而后,上述执行主体可以利用各种方式生成与上述所提取的声学特征对应的语音度量值。例如,上述执行主体可以利用预先训练的人工神经网络生成上述所提取的声学特征对应的语音度量值。其中,上述人工神经网络可以通过历史直播流数据的精彩片段对应的语音数据作为正样本、普通片段对应的语音数据作为负样本进行训练。上述语音度量值可以是0~1之间的数值。再例如,上述执行主体可以将上述所提取的声学特征与各声学特征对应的预设阈值进行比较,而后根据大于对应的预设阈值的数目生成上述所提取的声学特征对应的语音度量值。
在本实施例中,作为示例,上述执行主体可以通过各种方式对所获取的直播流数据中的直播交互数据进行处理,生成对应的交互度量值。例如,上述执行主体可以将弹幕或评论数超过预设阈值的时段数作为上述交互度量值。例如,上述直播流数据包括5分钟的数据。第0~1分钟的弹幕数目为15条,第1~2分钟的弹幕数目为28条,第2~3分钟的弹幕数目为85 条,第3~4分钟的弹幕数目为66条,第4~5分钟的弹幕数目为32条。假定上述预设阈值为50条,则上述执行主体可以将上述交互度量值确定为2。
在本实施例中,作为示例,上述执行主体可以通过各种方式对所获取的直播流数据中的视频数据进行处理,生成对应的视频度量值。例如,上述执行主体可以确定上述直播流数据中包括目标图像的视频帧的数目。根据所确定的包括目标图像的视频帧的数目生成上述直播流数据对应的视频度量值。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的视频度量值:
第一步,对视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目。
在这些实现方式中,上述执行主体可以利用各种图像识别方法对上述视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目。其中,上述第一预设类别图像和第二预设类别图像可以包括预先设定的与应用场景相关联的图像。作为示例,上述第一预设类别图像可以为扣篮图像,上述第二预设类别图像可以为在三分线外的投篮图像。
可选地,上述执行主体可以从上述视频数据中进行抽帧,例如每10帧抽取1帧。而后,上述执行主体所抽取的视频帧进行图像识别。从而节省计算资源。
可选地,上述第一预设类别图像可以包括用于表征商品售卖的图像,例如商品图像、价格标签等。上述第二预设类别图像可以包括预设人物图像。其中,上述预设人物例如可以是主播。从而,上述方法可以从图像识别的角度为识别直播售卖视频中的高光时刻提供技术基础。
第二步,根据所确定的属于第一预设类别图像和属于第二预设类别图像的数目,生成视频度量值。
在这些实现方式中,根据所确定的属于第一预设类别图像和属于第二预设类别图像的数目,上述执行主体可以通过各种方式生成视频度量值。作为示例,上述执行主体可以选取上述所确定的属于第一预设类别图像和属于第二预设类别图像的数目中较大的值作为上述视频度量值。作为又一示例,上述执行主体还可以将上述所选取的较大的值与进行图像识别的视频帧的比值作为上述视频度量值。
可选地,上述执行主体还可以首先获取与上述第一预设类别图像和第二预设类别图像分别对应的图像预设权重值(例如1和0.5)。而后,上述执行主体可以将上述所确定的属于第一预设类别图像和属于第二预设类别图像的数目与上述分别对应的图像预设权重值进行加权求和,生成上述视频度量值。
在本实施例的一些可选的实现方式中,基于上述直播流数据包括的语音数据,上述执行主体可以按照如下步骤对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值:
第一步,对语音数据进行语音识别,生成语音识别文本。
在这些实现方式中,上述执行主体可以通过各种语音识别技术对上述步骤201所获取的直播流数据包括的语音数据进行识别,生成对应的语音识别文本。
第二步,分别确定语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设 类别文本的数目。
在这些实现方式中,上述执行主体可以通过各自方式分别确定语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设类别文本的数目。其中,上述第一预设类别文本和第二预设类别文本可以包括预先设定的与应用场景相关联的文本。作为示例,上述第一预设类别文本可以包括“好球”、“漂亮”、“真精彩”等预设的描述词;上述第二预设类别文本可以包括“大家看一下”、“注意到”等预设提示词。
可选地,上述第一预设类别文本可以包括商品描述信息。其中,上述商品描述信息可以包括商品名称、商品评价信息(例如“真好吃”、“好用不贵”等)等。上述第二预设类别文本可以包括预设的售卖关键词。其中,上述预设的售卖关键词例如可以包括“上链接”、“快来买”等。从而,上述方法可以从语音识别的角度为识别直播售卖视频中的高光时刻提供技术基础。
第三步,根据所确定的属于第一预设类别文本和属于第二预设类别文本的数目,生成语音度量值。
在这些实现方式中,根据所确定的属于第一预设类别文本和属于第二预设类别文本的数目,上述执行主体可以通过各种方式生成语音度量值。作为示例,上述执行主体可以选取上述所确定的属于第一预设类别文本和属于第二预设类别文本的数目中较大的值作为上述语音度量值。作为又一示例,上述执行主体还可以将上述所选取的较大的值与识别文本中所包括的词的数目的比值作为上述语音度量值。
可选地,上述执行主体还可以首先获取与上述第一预设类别文本和第二预设类别文本分别对应的文本预设权重值。而后,上述执行主体可以将上述所确定的属于第一预设类别文本和属于第二预设类别文本的数目与上述分别对应的文本预设权重值进行加权求和,生成上述语音度量值。
可选地,基于上述第一预设类别文本包括的商品描述信息和上述第二预设类别文本包括的预设的售卖关键词,与上述第一预设类别文本对应的文本预设权重值(例如1)通常小于上述与第二预设类别文本对应的文本预设权重值(例如5)。
在本实施例的一些可选的实现方式中,基于上述直播流数据包括的直播交互数据,上述执行主体可以按照如下步骤对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的交互度量值:
第一步,确定直播交互数据所指示的目标交互行为的数目。
在这些实现方式中,上述执行主体可以根据各种方式确定上述步骤201所获取的直播流数据包括的直播交互数据所指示的目标交互行为的数目。其中,上述目标交互行为可以包括但不限于以下至少一项:发送弹幕,表征赞同主播的行为(例如点赞、送礼物),发送评论,写留言。
第二步,根据所确定的目标交互行为的数目,生成交互度量值。
在这些实现方式中,根据上述第一步所确定的目标交互行为的数目,上述执行主体可以通过各种方式生成交互度量值。作为示例,上述执行主体可以直接将上述所确定的目标交互行为的数目作为上述交互度量值。作为又一示例,上述执行主体还可以将上述所确定的目标交互行为的数目与预设数值的比值作为上述交互度量值。
可选地,上述目标交互行为可以包括第一预设交互行为、第二预设交互行为、第三预设 交互行为中的至少两者。根据所确定的目标交互行为的数目,上述执行主体可以根据所确定的目标交互行为的数目和上述第一预设交互行为、第二预设交互行为、第三预设交互行为对应的预设交互权重进行加权求和,生成上述交互度量值。其中,第一预设交互行为、第二预设交互行为、第三预设交互行为可以包括预先设定的与应用场景相关联的交互行为。
可选地,上述第一预设交互行为、第二预设交互行为、第三预设交互行为可以分别包括直播画面中出现商品链接、通过上述直播流数据提供的商品链接的生成订单、发送弹幕。从而,上述方法可以从交互行为的角度为识别直播售卖视频中的高光时刻提供技术基础。
步骤203,根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值。
在本实施例中,根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,上述执行主体可以通过各种方式生成直播流数据的综合度量值。作为示例,上述执行主体可以从上述所生成的语音度量值、交互度量值中的至少一项和视频度量值中选取最大值作为上述直播流数据的综合度量值。
在本实施例的一些可选的实现方式中,根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,上述执行主体可以按照如下步骤生成直播流数据的综合度量值:
第一步,获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重。
在这些实现方式中,上述执行主体可以首先获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重。其中,上述预设度量权重例如可以是0.3、0.3、0.4。
第二步,对所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化。
在这些实现方式中,上述执行主体可以对上述第一步所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化。从而使得将归一化后的第一步所生成的语音度量值、交互度量值中的至少一项和视频度量值属于同一个数量级。
第三步,将归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成直播流数据的综合度量值。
在这些实现方式中,上述执行主体可以将上述第二步所得到的归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成直播流数据的综合度量值。
步骤204,响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。
在本实施例中,响应于确定直播流数据的综合度量值满足预设条件,上述执行主体可以通过各种方式基于直播流数据生成目标视频。作为示例,上述预设条件可以包括上述直播流数据的综合度量值大于预设度量值阈值。作为示例,上述执行主体可以直接将上述直播流数据作为上述目标视频。作为又一示例,上述执行主体可以对上述直播流数据进行后处理,从而得到上述目标视频。其中,上述后处理例如可以包括添加滤镜、调整亮度、调整对比度等。
在本实施例的一些可选的实现方式中,上述预设条件可以包括:与上述直播流数据关联的直播流切片集合中满足综合度量值条件的直播流切片的数目大于目标数目。其中,上述综合度量值条件可以包括直播流切片对应的综合度量值小于上述直播流数据的综合度量值。上述目标数目可以是根据实际的应用需求预先指定的任意数目。上述目标数目也可以是根据规 则而定的数目,例如上述关联的直播流切片集合中所包括的直播流切片数目乘以预设比例而得到的数目。
作为示例,上述与上述直播流数据关联的直播流切片集合可以包括从对应于同一直播流信息源(例如直播间id)获取的分时段的直播流数据切片。假设与上述直播流数据关联的直播流切片集合中包括10条直播流切片。上述目标数目为6。若上述直播流切片集合对应的综合度量值小于上述直播流数据的综合度量值的数目大于6,则满足上述预设条件。
继续参见图3,图3是根据本申请实施例的用于生成目标视频的方法的应用场景的一个示意图。在图3的应用场景中,用户301使用终端设备302进行直播。终端设备302将直播流数据303发送至后台服务器304。其中,上述直播流数据可以包括语音数据和视频数据。后台服务器304对直播流数据303进行处理,根据所包括的表征精彩程度的对象(例如“好球”的语音、“扣篮图像”),生成语音度量值80和视频度量值70(如图3中305所示)。而后,后台服务器304根据所生成的语音度量值和视频度量值进行求平均,生成综合度量值75(如图3中306所示)。之后,根据综合度量值75大于预设阈值(例如70),后台服务器304可以基于直播流数据303生成精彩剪辑视频307。可选地,后台服务器304还可以将所生成的精彩剪辑视频307发送至终端设备302。
目前,现有技术之一通常是首先将直播流数据保存成长视频文件,再利用人工从长视频文件中截取所需要的片段,生成短视频,导致需要耗费大量的人力成本。而本申请的上述实施例提供的方法,通过对获取的直播流数据中包括的语音数据、直播交互数据中的至少一项和视频数据分别进行处理,生成由语音度量值、交互度量值中的至少一项和视频度量值而得到的综合度量值,最终生成目标视频。与通过人工截取的方法生成目标视频相比,本申请提供的方案实现了目标视频的自动生成,有效地减少了人工成本。与仅根据单一维度如音频或视频来生成目标视频的方法相比,本申请提供的方案从语音、直播交互中的至少一项和视频等多方面综合选定目标视频的生成基础,提高了所生成的目标视频的质量和生成效率。
进一步参考图4,其示出了用于生成目标视频的方法的又一个实施例的流程400。该用于生成目标视频的方法的流程400,包括以下步骤:
步骤401,获取直播流数据。
在本实施例中,上述直播流数据可以包括语音数据和视频数据。
步骤402,对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值。
步骤403,根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值。
上述步骤401、步骤402、步骤403分别与前述实施例中的步骤201、步骤202、步骤203及其可选的实现方式一致,上文针对步骤201、步骤202和步骤203及其可选的实现方式的描述也适用于步骤401、步骤402和步骤403,此处不再赘述。
步骤404,响应于确定直播流数据的综合度量值满足预设条件,根据语音数据对应的识别文本的语句连续性,确定直播流数据的剪辑起止位置;基于剪辑后的直播流数据生成目标视频。
在本实施例中,响应于确定直播流数据的综合度量值满足预设条件,用于生成目标视频的方法的执行主体(例如图1所示的服务器105)可以按照如下步骤生成目标视频:
第一步,根据语音数据对应的识别文本的语句完整性,确定直播流数据的剪辑起止位置。
在本实施例中,上述执行主体可以首先根据语音数据对应的识别文本确定语句完整性。而后,根据所确定的语句完整性,上述执行主体可以通过各种方式确定上述直播流数据的剪辑起止位置。其中,上述剪辑的起止位置可以包括剪辑的起始位置和结束位置。作为示例,响应于确定上述语音数据对应的识别文本语句完整(例如“XX真的很好吃”),上述执行主体可以将上述语音数据的起止位置确定为剪辑的起止位置。作为又一示例,响应于确定上述语音数据对应的识别文本语句不完整(例如“个球太精彩”、“接下来请大家关注第”),上述执行主体可以将仅具有后半句的语句的结束位置确定为剪辑的起始位置,将仅具有前半句的语句的起始位置确定为剪辑的结束位置。
第二步,基于剪辑后的直播流数据生成目标视频。
在本实施例中,上述执行主体可以通过各种方式基于剪辑后的直播流数据生成目标视频。作为示例,上述执行主体可以直接将剪辑后的直播流数据确定为上述目标视频。作为又一示例,上述执行主体可以对上述剪辑后的直播流数据进行后处理,根据后处理后的直播流数据生成目标视频。
在本实施例的一些可选的实现方式中,上述执行主体还可以向剪辑后的视频流数据添加特效,生成目标视频。其中,上述特效可以包括但不限于以下至少一项:字幕,贴纸,转场效果。
从图4中可以看出,本实施例中的用于生成目标视频的方法的流程400体现了根据语音数据对应的识别文本的语句完整性,确定直播流数据的剪辑起止位置的步骤,以及基于剪辑后的直播流数据生成目标视频的步骤。由此,本实施例描述的方案可以根据语音数据对应的识别文本的语句完整性生成目标视频,从而保证了目标视频中语句的完整性。
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了用于生成目标视频的装置的一个实施例,该装置实施例与图2或图4所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例提供的用于生成目标视频的装置500包括获取单元501、处理单元502、第一生成单元503和第二生成单元504。其中,获取单元501,被配置成获取直播流数据,其中,直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;处理单元502,被配置成对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;第一生成单元503,被配置成根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值;第二生成单元504,被配置成响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。
在本实施例中,用于生成目标视频的装置500中:获取单元501、处理单元502、第一生成单元503和第二生成单元504的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202、步骤203和步骤204的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,上述处理单元502可以包括第一识别子单元(图中未示出)、第一生成子单元(图中未示出)。其中,上述第一识别子单元,可以被配置成对视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目。上述第一生成子单元,可以被配置成根据所确定的属于第一预设类别图像和属于 第二预设类别图像的数目,生成视频度量值。
在本实施例的一些可选的实现方式中,上述第一生成子单元可以包括获取模块(图中未示出)、生成模块(图中未示出)。其中,上述获取模块,可以被配置成获取与第一预设类别图像和第二预设类别图像分别对应的预设图像权重值。上述生成模块,可以被配置成将所确定的属于第一预设类别图像和属于第二预设类别图像的数目与分别对应的图像预设权重值进行加权求和,生成上述视频度量值。
在本实施例的一些可选的实现方式中,上述直播流数据可以包括语音数据。上述处理单元502可以包括第二识别子单元(图中未示出)、第一确定子单元(图中未示出)、第二生成子单元(图中未示出)。其中,上述第二识别子单元,可以被配置成对语音数据进行语音识别,生成语音识别文本。上述第一确定子单元,可以被配置成分别确定语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设类别文本的数目。上述第二生成子单元,可以被配置成根据所确定的属于第一预设类别文本和属于第二预设类别文本的数目,生成语音度量值。
在本实施例的一些可选的实现方式中,上述直播流数据可以包括直播交互数据。上述处理单元502可以包括第二确定子单元(图中未示出)、第二生成子单元(图中未示出)。其中,上述第二确定子单元,可以被配置成确定直播交互数据所指示的目标交互行为的数目。上述第三生成子单元,可以被配置成根据所确定的目标交互行为的数目,生成交互度量值。
在本实施例的一些可选的实现方式中,上述目标交互行为可以包括第一预设交互行为、第二预设交互行为、第三预设交互行为中的至少两者。上述第三生成子单元可以进一步被配置成:根据所确定的目标交互行为的数目和第一预设交互行为、第二预设交互行为、第三预设交互行为对应的预设交互权重进行加权求和,生成交互度量值。
在本实施例的一些可选的实现方式中,上述第一生成单元503可以包括获取子单元(图中未示出)、归一化子单元(图中未示出)、第四生成子单元(图中未示出)。其中,上述获取子单元,可以被配置成获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重。上述归一化子单元,可以被配置成对所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化。上述第四生成子单元,可以被配置成将归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成直播流数据的综合度量值。
在本实施例的一些可选的实现方式中,上述预设条件可以包括:与直播流数据关联的直播流切片集合中满足综合度量值条件的直播流切片的数目大于目标数目,其中,综合度量值条件包括直播流切片对应的综合度量值小于直播流数据的综合度量值。
在本实施例的一些可选的实现方式中,上述直播流数据可以包括语音数据。上述第二生成单元504可以包括第三确定子单元(图中未示出)、第五生成子单元(图中未示出)。其中,上述第三确定子单元,可以被配置成根据语音数据对应的识别文本的语句完整性,确定直播流数据的剪辑起止位置。上述第五生成子单元,可以被配置成基于剪辑后的直播流数据生成目标视频。
在本实施例的一些可选的实现方式中,上述第五生成子单元可以被进一步配置成向剪辑后的视频流数据添加特效,生成目标视频。
本申请的上述实施例提供的装置,通过获取单元501获取直播流数据。其中,直播流数 据包括语音数据、直播交互数据中的至少一项和视频数据。而后,处理单元502对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值。接下来,第一生成单元503根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值。最后,第二生成单元504响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。从而一方面实现了目标视频的自动生成,另一方面从语音、直播交互中的至少一项和视频等多方面综合选定目标视频的生成基础,提高了所生成的目标视频的质量和生成效率。
下面参考图6,其示出了适于用来实现本申请实施例的电子设备(例如图1中的服务器)600的结构示意图。本申请实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理,personal digital assistant)、PAD(平板电脑,portable android device)、PMP(便携式多媒体播放器,personal multimedia player)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的服务器仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(read-only memory,ROM)602中的程序或者从存储装置608加载到随机访问存储器(random access memory,RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(input/output,I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头等的输入装置606;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本申请的实施例的方法中限定的上述功能。
需要说明的是,本申请的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(electrical programmable ROM,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(compact disc ROM,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。 在本申请的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述服务器中所包含的;也可以是单独存在,而未装配入该服务器中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该服务器执行时,使得该服务器:获取直播流数据,其中,直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;对直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成直播流数据的综合度量值;响应于确定直播流数据的综合度量值满足预设条件,基于直播流数据生成目标视频。
本公开还提供了一种计算机程序,该计算机程序使得计算机执行上述实施例所提供的用于生成目标视频的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言、Python或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(local area network,LAN)或广域网(wide area network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本申请的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括获取单元、处理单元、第一生成单元、第二生成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取直播流数据的单元,其 中,直播流数据包括语音数据、直播交互数据中的至少一项和视频数据”。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请的实施例中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述本公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (15)

  1. 一种用于生成目标视频的方法,包括:
    获取直播流数据,其中,所述直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;
    对所述直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;
    根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成所述直播流数据的综合度量值;
    响应于确定所述直播流数据的综合度量值满足预设条件,基于所述直播流数据生成目标视频。
  2. 根据权利要求1所述的方法,其中,所述对所述直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的视频度量值,包括:
    对所述视频数据中的视频帧进行图像识别,分别确定属于第一预设类别图像和属于第二预设类别图像的数目;
    根据所确定的属于所述第一预设类别图像和属于所述第二预设类别图像的数目,生成所述视频度量值。
  3. 根据权利要求2所述的方法,其中,所述根据所确定的属于所述第一预设类别图像和属于所述第二预设类别图像的数目,生成所述视频度量值,包括:
    获取与所述第一预设类别图像和第二预设类别图像分别对应的预设图像权重值;
    将所述所确定的属于第一预设类别图像和属于第二预设类别图像的数目与所述分别对应的图像预设权重值进行加权求和,生成上述视频度量值。
  4. 根据权利要求1-3中任一项所述的方法,其中,所述直播流数据包括语音数据;以及
    所述对所述直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值,包括:
    对所述语音数据进行语音识别,生成语音识别文本;
    分别确定所述语音识别文本中所包括的文本属于第一预设类别文本和属于第二预设类别文本的数目;
    根据所确定的属于所述第一预设类别文本和属于所述第二预设类别文本的数目,生成所述语音度量值。
  5. 根据权利要求1-4中任一项所述的方法,其中,所述直播流数据包括直播交互数据;以及
    所述对所述直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的交互度量值,包括:
    确定所述直播交互数据所指示的目标交互行为的数目;
    根据所确定的目标交互行为的数目,生成所述交互度量值。
  6. 根据权利要求5所述的方法,其中,所述目标交互行为包括第一预设交互行为、第二预设交互行为、第三预设交互行为中的至少两者;以及
    所述根据所确定的目标交互行为的数目,生成所述交互度量值,包括:
    根据所确定的目标交互行为的数目和所述第一预设交互行为、所述第二预设交互行为、 所述第三预设交互行为对应的预设交互权重进行加权求和,生成所述交互度量值。
  7. 根据权利要求1-6中任一项所述的方法,其中,所述根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成所述直播流数据的综合度量值,包括:
    获取与所生成的语音度量值、交互度量值中的至少一项和视频度量值各自对应的预设度量权重;
    对所生成的语音度量值、交互度量值中的至少一项和视频度量值进行归一化;
    将归一化后的语音度量值、交互度量值中的至少一项和视频度量值进行加权求和,生成所述直播流数据的综合度量值。
  8. 根据权利要求1-7中任一项所述的方法,其中,所述预设条件包括:与所述直播流数据关联的直播流切片集合中满足综合度量值条件的直播流切片的数目大于目标数目,其中,所述综合度量值条件包括直播流切片对应的综合度量值小于所述直播流数据的综合度量值。
  9. 根据权利要求1-8中任一项所述的方法,其中,所述直播流数据包括语音数据;以及
    所述基于所述直播流数据生成目标视频,包括:
    根据所述语音数据对应的识别文本的语句完整性,确定所述直播流数据的剪辑起止位置;
    基于剪辑后的直播流数据生成目标视频。
  10. 根据权利要求9所述的方法,其中,所述基于剪辑后的直播流数据生成目标视频,包括:
    向所述剪辑后的直播流数据添加特效,生成目标视频。
  11. 一种用于生成目标视频的装置,包括:
    获取单元,被配置成获取直播流数据,其中,所述直播流数据包括语音数据、直播交互数据中的至少一项和视频数据;
    处理单元,被配置成对所述直播流数据进行处理,根据处理结果所包括的目标对象,生成对应的语音度量值、交互度量值中的至少一项和视频度量值;
    第一生成单元,被配置成根据所生成的语音度量值、交互度量值中的至少一项和视频度量值,生成所述直播流数据的综合度量值;
    第二生成单元,被配置成响应于确定所述直播流数据的综合度量值满足预设条件,基于所述直播流数据生成目标视频。
  12. 一种服务器,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的方法。
  13. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-10中任一所述的方法。
  14. 一种计算机程序产品,其特征在于,包括计算机程序指令,所述计算机程序指令被计算机执行时,使得所述计算机实现如权利要求1-10中任一项所述的方法。
  15. 一种计算机程序,其特征在于,所述计算机程序被计算机执行时,使得所述计算机实现如权利要求1-10中任一项所述的方法。
PCT/CN2021/112140 2020-08-12 2021-08-11 用于生成目标视频的方法、装置、服务器和介质 WO2022033534A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21855585.2A EP4178135A4 (en) 2020-08-12 2021-08-11 METHOD OF GENERATION OF TARGET VIDEO, DEVICE, SERVER AND MEDIUM
JP2023507247A JP2023535989A (ja) 2020-08-12 2021-08-11 ターゲットビデオを生成するための方法、装置、サーバ及び媒体
US17/819,507 US11750898B2 (en) 2020-08-12 2022-08-12 Method for generating target video, apparatus, server, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010806612.5 2020-08-12
CN202010806612.5A CN111935155B (zh) 2020-08-12 2020-08-12 用于生成目标视频的方法、装置、服务器和介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/819,507 Continuation US11750898B2 (en) 2020-08-12 2022-08-12 Method for generating target video, apparatus, server, and medium

Publications (1)

Publication Number Publication Date
WO2022033534A1 true WO2022033534A1 (zh) 2022-02-17

Family

ID=73311216

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112140 WO2022033534A1 (zh) 2020-08-12 2021-08-11 用于生成目标视频的方法、装置、服务器和介质

Country Status (5)

Country Link
US (1) US11750898B2 (zh)
EP (1) EP4178135A4 (zh)
JP (1) JP2023535989A (zh)
CN (1) CN111935155B (zh)
WO (1) WO2022033534A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111935155B (zh) 2020-08-12 2021-07-30 北京字节跳动网络技术有限公司 用于生成目标视频的方法、装置、服务器和介质
CN112584224B (zh) * 2020-12-08 2024-01-02 北京字节跳动网络技术有限公司 信息显示及处理方法、装置、设备、介质
CN113766282B (zh) * 2021-10-20 2023-10-27 上海哔哩哔哩科技有限公司 一种直播视频处理方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721165B1 (en) * 2015-11-13 2017-08-01 Amazon Technologies, Inc. Video microsummarization
CN107172482A (zh) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 图像互换格式图片的生成方法及装置
CN109326310A (zh) * 2017-07-31 2019-02-12 西梅科技(北京)有限公司 一种自动剪辑的方法、装置及电子设备
US20190188320A1 (en) * 2017-12-14 2019-06-20 Facebook, Inc. Systems and methods for providing ephemeral content items created from live stream videos
CN110650374A (zh) * 2019-08-16 2020-01-03 咪咕文化科技有限公司 剪辑方法、电子设备和计算机可读存储介质
CN111050191A (zh) * 2019-12-30 2020-04-21 腾讯科技(深圳)有限公司 一种视频生成方法、装置、计算机设备和存储介质
CN111935155A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 用于生成目标视频的方法、装置、服务器和介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5121367B2 (ja) * 2007-09-25 2013-01-16 株式会社東芝 映像を出力する装置、方法およびシステム
CN102427507B (zh) * 2011-09-30 2014-03-05 北京航空航天大学 一种基于事件模型的足球视频集锦自动合成方法
CN103605719A (zh) * 2013-11-13 2014-02-26 苏州天擎电子通讯有限公司 一种可检索体育赛事精彩片段的视频电子软件
US11012719B2 (en) * 2016-03-08 2021-05-18 DISH Technologies L.L.C. Apparatus, systems and methods for control of sporting event presentation based on viewer engagement
US20180082313A1 (en) * 2016-09-22 2018-03-22 MyChannel Inc. Systems and methods for prioritizing user reactions to content for response on a social-media platform
US10440431B1 (en) * 2016-11-28 2019-10-08 Amazon Technologies, Inc. Adaptive and automatic video scripting
CN108650531A (zh) * 2018-07-17 2018-10-12 北京引领海逛科技有限公司 视频内容快速匹配产品的方法和系统
CN110856013A (zh) * 2019-11-19 2020-02-28 珠海格力电器股份有限公司 识别视频中关键片段的方法、系统和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721165B1 (en) * 2015-11-13 2017-08-01 Amazon Technologies, Inc. Video microsummarization
CN107172482A (zh) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 图像互换格式图片的生成方法及装置
CN109326310A (zh) * 2017-07-31 2019-02-12 西梅科技(北京)有限公司 一种自动剪辑的方法、装置及电子设备
US20190188320A1 (en) * 2017-12-14 2019-06-20 Facebook, Inc. Systems and methods for providing ephemeral content items created from live stream videos
CN110650374A (zh) * 2019-08-16 2020-01-03 咪咕文化科技有限公司 剪辑方法、电子设备和计算机可读存储介质
CN111050191A (zh) * 2019-12-30 2020-04-21 腾讯科技(深圳)有限公司 一种视频生成方法、装置、计算机设备和存储介质
CN111935155A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 用于生成目标视频的方法、装置、服务器和介质

Also Published As

Publication number Publication date
CN111935155A (zh) 2020-11-13
CN111935155B (zh) 2021-07-30
EP4178135A1 (en) 2023-05-10
EP4178135A4 (en) 2023-07-05
US11750898B2 (en) 2023-09-05
US20220385996A1 (en) 2022-12-01
JP2023535989A (ja) 2023-08-22

Similar Documents

Publication Publication Date Title
CN109460513B (zh) 用于生成点击率预测模型的方法和装置
WO2022033534A1 (zh) 用于生成目标视频的方法、装置、服务器和介质
CN108989882B (zh) 用于输出视频中的音乐片段的方法和装置
CN109976997B (zh) 测试方法和装置
CN111800671B (zh) 用于对齐段落和视频的方法和装置
WO2020000876A1 (zh) 用于生成模型的方法和装置
CN111314733A (zh) 用于评估视频清晰度的方法和装置
CN109582825B (zh) 用于生成信息的方法和装置
CN109214501B (zh) 用于识别信息的方法和装置
CN112530408A (zh) 用于识别语音的方法、装置、电子设备和介质
CN109862100B (zh) 用于推送信息的方法和装置
US20200137429A1 (en) Video media content analysis
US20240061899A1 (en) Conference information query method and apparatus, storage medium, terminal device, and server
CN112509562A (zh) 用于文本后处理的方法、装置、电子设备和介质
CN111897950A (zh) 用于生成信息的方法和装置
CN115801980A (zh) 视频生成方法和装置
CN111883139A (zh) 用于筛选目标语音的方法、装置、设备和介质
CN111125502B (zh) 用于生成信息的方法和装置
CN108664610B (zh) 用于处理数据的方法和装置
CN111862933A (zh) 用于生成合成语音的方法、装置、设备和介质
WO2022042398A1 (zh) 用于确定对象添加方式的方法、装置、电子设备和介质
CN114613350A (zh) 测试方法、装置、电子设备和存储介质
CN111797273B (zh) 用于调整参数的方法和装置
CN111949860B (zh) 用于生成相关度确定模型的方法和装置
CN113704541A (zh) 训练数据的获取、视频推送方法、装置、介质及电子设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023507247

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021855585

Country of ref document: EP

Effective date: 20230131

NENP Non-entry into the national phase

Ref country code: DE